Skip to content
113 changes: 113 additions & 0 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# DocumentDB Samples — Copilot Instructions

## Project Overview
Azure DocumentDB code samples for vector search and algorithm selection quickstart articles.

## Repository Structure

```
ai/
├── data/ # Shared data files (Hotels.json, Hotels_Vector.json)
├── vector-search-python/ # Python vector search samples
├── vector-search-typescript/ # TypeScript/Node.js vector search samples
├── vector-search-go/ # Go vector search samples
├── vector-search-java/ # Java vector search samples
├── vector-search-dotnet/ # .NET vector search samples
├── vector-search-agent-go/ # Go agent sample (separate from quickstart)
└── vector-search-agent-typescript/ # TypeScript agent sample (separate from quickstart)
```

### Sample Categories
- **Quickstart samples** (`vector-search-{language}/`): Single algorithm per file, one entry point, uses `MONGO_CLUSTER_NAME` + quickstart env vars
- **Agent samples** (`vector-search-agent-{language}/`): Multi-LLM orchestration, three entry points (upload/agent/cleanup), uses `AZURE_DOCUMENTDB_*` env vars

Each quickstart sample directory contains:
- `src/` — Source files: one per algorithm (`ivf`, `hnsw`, `diskann`) + `utils` + `create_embeddings` + `show_indexes`
- `output/` — Expected output files: `ivf.txt`, `hnsw.txt`, `diskann.txt`
- `README.md` — Setup, usage, and troubleshooting documentation
- `.env.example` (Go, Python, TypeScript) or `appsettings.json` (.NET) — Configuration template

## Language Dependencies

### Go
- Go 1.21+
- go.mongodb.org/mongo-driver v1.17+
- github.com/Azure/azure-sdk-for-go/sdk/azidentity
- github.com/Azure/azure-sdk-for-go/sdk/azcore
- github.com/openai/openai-go/v3

### Java
- Java 17+
- MongoDB Driver (mongodb-driver-sync) 5.3+
- Azure Identity (azure-identity) 1.15+
- Azure AI OpenAI (azure-ai-openai)
- Maven 3.8+

### Python
- Python 3.10+
- pymongo >= 4.7
- azure-identity
- openai

### TypeScript/Node.js
- Node.js 20+
- mongodb 6.12+
- @azure/identity
- openai

### .NET
- .NET 8+
- MongoDB.Driver 3.0+
- Azure.Identity
- Azure.AI.OpenAI

## Consistent Variable Values

All samples MUST use these environment variable names and defaults:

| Variable | Default | Purpose |
|----------|---------|---------|
| MONGO_CLUSTER_NAME | (required) | DocumentDB cluster name (passwordless auth) |
| MONGO_CONNECTION_STRING | (none) | Full connection string (connection string auth) |
| AZURE_OPENAI_EMBEDDING_ENDPOINT | (required) | Azure OpenAI endpoint |
| AZURE_OPENAI_EMBEDDING_MODEL | (required) | Embedding model deployment name |
| AZURE_OPENAI_EMBEDDING_API_VERSION | 2023-05-15 | Azure OpenAI API version |
| DATA_FILE_WITH_VECTORS | ../data/Hotels_Vector.json | Path to data file with embeddings |
| EMBEDDED_FIELD | DescriptionVector | Vector field name in documents |
| EMBEDDING_DIMENSIONS | 1536 | Vector dimensions |
| LOAD_SIZE_BATCH | 100 | Batch size for document insertion |
| EMBEDDING_SIZE_BATCH | 16 | Batch size for embedding generation |
| AZURE_DOCUMENTDB_DATABASENAME | Hotels | Database name |
| SIMILARITY | (varies) | Similarity metric (COS, euclidean, ip) |
| ALGORITHM | (varies) | Algorithm (ivf, hnsw, diskann) |

## Consistent Algorithm Parameters

### IVF
- numLists: 1 *(quickstart samples; agent samples use `IVF_NUM_LISTS=10` for production-like config)*
- nProbes: 1

### HNSW
- m: 16
- efConstruction: 64
- efSearch: 40

### DiskANN
- maxDegree: 20
- lBuild: 10
- lSearch: 40

## Rules

1. **No Cosmos DB references.**Never use "Cosmos DB", "cosmosdb", "MongoDB vCore", or "mongo.cosmos.azure.com". Always use "Azure DocumentDB" and "documentdb.azure.com". Exception: `mongocluster.cosmos.azure.com` (hostname), `cosmosSearch` (API command), and `ms-azuretools.vscode-cosmosdb` (VS Code extension) are valid and NOT Cosmos references.
2. **Vector field name is DescriptionVector.** Never default to "contentVector".
3. **Data file path from env var.** Code reads `DATA_FILE_WITH_VECTORS` which defaults to `../data/Hotels_Vector.json` (the shared data location). .NET copies data locally to `data/Hotels_Vector.json` in the build output.
4. **Batch size is LOAD_SIZE_BATCH=100.** Do not use BATCH_SIZE or other variants.
5. **Database name variable is AZURE_DOCUMENTDB_DATABASENAME.** Do not use MONGO_DB_NAME or other variants.
6. **.NET uses appsettings.json** with configuration sections: `AzureOpenAI`, `DataFiles`, `Embedding`, `MongoDB`, `VectorSearch`. Environment variables override config using `Section__Key` format (e.g., `AzureOpenAI__Endpoint`).
7. **Similarity metric is COS.** All vector index definitions use `"similarity": "COS"` (cosine similarity).
8. **Output files are committed.** Each sample has an `output/` directory with expected output for each algorithm (`ivf.txt`, `hnsw.txt`, `diskann.txt`). Update these when output format changes.
9. **DocumentDB supports all index types at any dataset size.** IVF, HNSW, and DiskANN are all available — do not imply tier restrictions limit algorithm availability.
10. **No dotenv libraries.** Do NOT use `python-dotenv`, `godotenv`, `dotenv` (npm), or any `.env` file-loading library. Environment variables must be passed via the CLI invocation, not loaded from `.env` files at runtime. This keeps samples explicit and avoids hidden configuration.
11. **Collection naming:** `hotels_{algorithm}` (e.g., `hotels_ivf`, `hotels_hnsw`, `hotels_diskann`). Index naming: `vectorIndex_{algorithm}`.
12. **Vector search uses k=5.** All samples return top 5 results. Do not parameterize k unless explicitly required.
89 changes: 89 additions & 0 deletions .github/instructions/agent-samples.instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
applyTo: "ai/vector-search-agent-*/**"
---
# Agent Samples (Multi-LLM Convention)

Agent samples (`vector-search-agent-*`) use a **different convention** from quickstart samples. They orchestrate multiple LLM deployments and use a distinct set of environment variables. Do NOT mix agent conventions with quickstart conventions.

## Architecture: Planner → Synthesizer

Agent samples use a two-agent pipeline with three Azure OpenAI deployments:

| Deployment | Role | Temperature | Purpose |
|------------|------|-------------|---------|
| Embedding | Vector search | — | Same as quickstart samples |
| Planner | Tool-calling agent | 0.0 | Transforms user query → tool call → retrieves search results |
| Synthesizer | Response generation | 0.3 | Takes search results + query → produces natural language recommendation |

The planner invokes a `search_hotels_collection` tool that performs the vector search. The synthesizer receives the search results and generates a comparative hotel recommendation.

## Agent Entry Points

Agent samples have three separate entry points (not a single main file):

| Entry Point | Purpose |
|-------------|---------|
| `upload` | Load hotel data, create embeddings, insert into DocumentDB, create vector index |
| `agent` | Run planner → synthesizer pipeline against an existing collection |
| `cleanup` | Drop the database |

## Agent Environment Variables

Agent samples use `AZURE_DOCUMENTDB_*` and `AZURE_OPENAI_*` prefixes consistently. These differ from quickstart variable names.

| Agent Variable | Quickstart Equivalent | Notes |
|---------------|----------------------|-------|
| `AZURE_OPENAI_ENDPOINT` | `AZURE_OPENAI_EMBEDDING_ENDPOINT` | Single endpoint for all 3 deployments |
| `AZURE_OPENAI_API_KEY` | — | For API key auth (not used in quickstarts) |
| `AZURE_DOCUMENTDB_CLUSTER` | `MONGO_CLUSTER_NAME` | Cluster name for passwordless auth |
| `AZURE_DOCUMENTDB_CONNECTION_STRING` | `MONGO_CONNECTION_STRING` | Full connection string |
| `AZURE_DOCUMENTDB_COLLECTION` | — | Collection name (agents parameterize this) |
| `AZURE_DOCUMENTDB_INDEX_NAME` | — | Vector index name (agents parameterize this) |
| `VECTOR_INDEX_ALGORITHM` | `ALGORITHM` | Default: `vector-ivf` |
| `VECTOR_SIMILARITY` | `SIMILARITY` | Default: `COS` |
| `USE_PASSWORDLESS` | — | `true`/`false` toggle for auth mode |
| `DEBUG` | — | `true`/`false` verbose logging |
| `QUERY` | — | Default: `"quintessential lodging near running trails, eateries, retail"` |
| `NEAREST_NEIGHBORS` | — | Default: `5` |

**Agent-only variables (no quickstart equivalent):**

| Variable | Default | Purpose |
|----------|---------|---------|
| `AZURE_OPENAI_EMBEDDING_DEPLOYMENT` | (required) | Embedding model deployment name |
| `AZURE_OPENAI_EMBEDDING_API_VERSION` | 2024-06-01 (Go), 2023-05-15 (TS) | Embedding API version |
| `AZURE_OPENAI_PLANNER_DEPLOYMENT` / `AZURE_OPENAI_PLANNER_MODEL` | (required) | Planner LLM deployment |
| `AZURE_OPENAI_PLANNER_API_VERSION` | (required) | Planner API version |
| `AZURE_OPENAI_SYNTH_DEPLOYMENT` / `AZURE_OPENAI_SYNTH_MODEL` | (required) | Synthesizer LLM deployment |
| `AZURE_OPENAI_SYNTH_API_VERSION` | (required) | Synthesizer API version |
| `IVF_NUM_LISTS` | 10 | IVF numLists (⚠️ differs from quickstart default of 1) |
| `HNSW_M` | 16 | HNSW m parameter |
| `HNSW_EF_CONSTRUCTION` | 64 | HNSW efConstruction parameter |
| `DISKANN_MAX_DEGREE` | 20 | DiskANN maxDegree parameter |
| `DISKANN_L_BUILD` | 10 | DiskANN lBuild parameter |

## Agent Authentication

Agents support passwordless (OIDC) and API key auth, toggled by `USE_PASSWORDLESS`.

**OIDC scopes:**
- DocumentDB: `https://ossrdbms-aad.database.windows.net/.default`
- Azure OpenAI: `https://cognitiveservices.azure.com/.default`

**MongoDB URI (passwordless):** `mongodb+srv://{cluster}.global.mongocluster.cosmos.azure.com/`
- Auth mechanism: `MONGODB-OIDC` with machine callback

## Language-Specific SDK Stacks

| Language | MongoDB | OpenAI | Agent Framework |
|----------|---------|--------|-----------------|
| Go | `go.mongodb.org/mongo-driver` (raw) | `github.com/openai/openai-go/v3` (raw) | Manual tool-calling loop |
| TypeScript | `mongodb` (cleanup only) | `@langchain/openai` | `langchain` + `@langchain/azure-cosmosdb` + `zod` |

**TypeScript agents use LangChain** — the `@langchain/azure-cosmosdb` package manages the vector store, and `langchain`'s `createAgent` handles tool orchestration. This is a fundamentally different SDK stack from the quickstart TypeScript samples which use the raw MongoDB driver.

**Go agents use raw SDKs** — both MongoDB driver and OpenAI SDK are used directly, with manual tool-calling implementation.

## IVF numLists Discrepancy

Agent samples default to `IVF_NUM_LISTS=10`. Quickstart samples (vector-search, select-algorithm) hardcode `numLists=1`. This is intentional — agent samples are designed for tunable, production-like configurations while quickstart samples use minimal values for simplicity.
138 changes: 138 additions & 0 deletions .github/instructions/cli-examples.instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
---
applyTo: "ai/**"
---
# Running Samples — CLI Invocation

Environment variables are passed inline with the run command. Do NOT use `.env` files. Each example below shows the required variables for a vector-search quickstart sample.

> **Note:** Examples show `ivf` but the same pattern applies to all algorithms — replace `ivf` with `hnsw` or `diskann` in file/class names.

## Go

**Bash:**
```bash
MONGO_CLUSTER_NAME=myCluster \
AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \
AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \
go run ./src/ivf.go
```

**PowerShell:**
```powershell
$env:MONGO_CLUSTER_NAME="myCluster"
$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/"
$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002"
go run ./src/ivf.go
```

## Python

**Bash:**
```bash
MONGO_CLUSTER_NAME=myCluster \
AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \
AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \
python src/ivf.py
```

**PowerShell:**
```powershell
$env:MONGO_CLUSTER_NAME="myCluster"
$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/"
$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002"
python src/ivf.py
```

## TypeScript/Node.js

**Bash:**
```bash
MONGO_CLUSTER_NAME=myCluster \
AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \
AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \
npx tsx src/ivf.ts
```

**PowerShell:**
```powershell
$env:MONGO_CLUSTER_NAME="myCluster"
$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/"
$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002"
npx tsx src/ivf.ts
```

## Java

**Bash:**
```bash
MONGO_CLUSTER_NAME=myCluster \
AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \
AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \
mvn compile exec:java -Dexec.mainClass="com.azure.documentdb.sample.IVF"
```

**PowerShell:**
```powershell
$env:MONGO_CLUSTER_NAME="myCluster"
$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/"
$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002"
mvn compile exec:java -Dexec.mainClass="com.azure.documentdb.sample.IVF"
```

## .NET

.NET uses `appsettings.json` for configuration, but environment variables can override:

**Bash:**
```bash
DocumentDB__ClusterName=myCluster \
AzureOpenAI__Endpoint=https://myendpoint.openai.azure.com/ \
AzureOpenAI__DeploymentName=text-embedding-ada-002 \
dotnet run
```

**PowerShell:**
```powershell
$env:DocumentDB__ClusterName="myCluster"
$env:AzureOpenAI__Endpoint="https://myendpoint.openai.azure.com/"
$env:AzureOpenAI__DeploymentName="text-embedding-ada-002"
dotnet run
```

## Agent Samples (Multi-LLM)

Agent samples require more variables for the planner and synthesizer deployments:

**Bash:**
```bash
AZURE_OPENAI_ENDPOINT=https://myendpoint.openai.azure.com/ \
AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002 \
AZURE_OPENAI_EMBEDDING_API_VERSION=2024-06-01 \
AZURE_OPENAI_PLANNER_DEPLOYMENT=gpt-4o \
AZURE_OPENAI_PLANNER_API_VERSION=2024-06-01 \
AZURE_OPENAI_SYNTH_DEPLOYMENT=gpt-4o \
AZURE_OPENAI_SYNTH_API_VERSION=2024-06-01 \
AZURE_DOCUMENTDB_CLUSTER=myCluster \
AZURE_DOCUMENTDB_DATABASENAME=Hotels \
AZURE_DOCUMENTDB_COLLECTION=hotels \
AZURE_DOCUMENTDB_INDEX_NAME=vectorIndex \
USE_PASSWORDLESS=true \
go run ./cmd/agent/main.go
```

**PowerShell:**
```powershell
$env:AZURE_OPENAI_ENDPOINT="https://myendpoint.openai.azure.com/"
$env:AZURE_OPENAI_EMBEDDING_DEPLOYMENT="text-embedding-ada-002"
$env:AZURE_OPENAI_EMBEDDING_API_VERSION="2024-06-01"
$env:AZURE_OPENAI_PLANNER_DEPLOYMENT="gpt-4o"
$env:AZURE_OPENAI_PLANNER_API_VERSION="2024-06-01"
$env:AZURE_OPENAI_SYNTH_DEPLOYMENT="gpt-4o"
$env:AZURE_OPENAI_SYNTH_API_VERSION="2024-06-01"
$env:AZURE_DOCUMENTDB_CLUSTER="myCluster"
$env:AZURE_DOCUMENTDB_DATABASENAME="Hotels"
$env:AZURE_DOCUMENTDB_COLLECTION="hotels"
$env:AZURE_DOCUMENTDB_INDEX_NAME="vectorIndex"
$env:USE_PASSWORDLESS="true"
go run ./cmd/agent/main.go
```
53 changes: 53 additions & 0 deletions .github/instructions/execution-patterns.instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
applyTo: "ai/vector-search-*/**"
---
# Sample Execution Patterns

## Authentication

All samples support two authentication modes. **Passwordless (OIDC) is preferred.**

### Passwordless Authentication (Recommended)
- Uses `DefaultAzureCredential` / OIDC with `MONGO_CLUSTER_NAME`
- Connection URI format: `mongodb+srv://{clusterName}.global.mongocluster.cosmos.azure.com/`
- OIDC token scope: `https://ossrdbms-aad.database.windows.net/.default`
- Each language implements a utility function pair: `getClients()` and `getClientsPasswordless()`

### Connection String Authentication
- Uses `MONGO_CONNECTION_STRING` with username/password
- Format: `mongodb+srv://username:password@{cluster}.mongocluster.cosmos.azure.com/?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000`

> **Note:** `mongocluster.cosmos.azure.com` is the current DocumentDB hostname — this is NOT a Cosmos DB reference.

## Sample Execution Pattern

All vector search samples follow this consistent lifecycle:

1. **Initialize clients** — Create MongoDB and Azure OpenAI clients (passwordless preferred)
2. **Drop collection** — Drop the algorithm-specific collection if it exists (clean start)
3. **Create collection** — Create a fresh collection
4. **Load data** — Read `Hotels_Vector.json` and batch-insert documents
5. **Create vector index** — Create algorithm-specific vector index using `createIndexes` command with `cosmosSearch` key type
6. **Generate query embedding** — Embed the search query text using Azure OpenAI
7. **Perform vector search** — Run `$search` aggregation pipeline with `cosmosSearch` operator
8. **Print results** — Display `HotelName` and `score` for top results
9. **Cleanup** — Drop the collection in a `finally`/`defer` block

### Naming Conventions
- **Collection names:** `hotels_{algorithm}` — e.g., `hotels_ivf`, `hotels_hnsw`, `hotels_diskann`
- **Index names:** `vectorIndex_{algorithm}` — e.g., `vectorIndex_ivf`, `vectorIndex_hnsw`, `vectorIndex_diskann`
- **Database name:** `Hotels` (hardcoded, matches `AZURE_DOCUMENTDB_DATABASENAME` default)

### Standard Search Query
All samples use the same query text: `"quintessential lodging near running trails, eateries, retail"`

### Vector Search Pipeline Structure
All languages use the same aggregation pipeline structure:
```
[
{ "$search": { "cosmosSearch": { "vector": <queryEmbedding>, "path": "DescriptionVector", "k": 5 } } },
{ "$project": { "score": { "$meta": "searchScore" }, "document": "$$ROOT" } }
]
```

> **Note:** `cosmosSearch` is a valid MongoDB API command name for DocumentDB — this is NOT a Cosmos DB reference.