From 6632d9dc2a9124982afb3545c59e4500c7e48f1c Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 8 May 2026 09:36:17 -0700 Subject: [PATCH 1/8] chore: add .github/copilot-instructions.md for project conventions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/copilot-instructions.md | 78 +++++++++++++++++++++++++++++++++ 1 file changed, 78 insertions(+) create mode 100644 .github/copilot-instructions.md diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 0000000..bdc2ae9 --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,78 @@ +# DocumentDB Samples — Copilot Instructions + +## Project Overview +Azure DocumentDB code samples for vector search and algorithm selection quickstart articles. + +## Language Dependencies + +### Go +- Go 1.21+ +- go.mongodb.org/mongo-driver v1.17+ +- github.com/Azure/azure-sdk-for-go/sdk/azidentity +- github.com/Azure/azure-sdk-for-go/sdk/azcore + +### Java +- Java 17+ +- MongoDB Driver 5.3+ +- Azure Identity 1.15+ +- Maven 3.8+ + +### Python +- Python 3.10+ +- pymongo >= 4.7 +- azure-identity +- openai + +### TypeScript/Node.js +- Node.js 20+ +- mongodb 6.12+ +- @azure/identity +- openai + +### .NET +- .NET 8+ +- MongoDB.Driver 3.2+ +- Azure.Identity + +## Consistent Variable Values + +All samples MUST use these environment variable names and defaults: + +| Variable | Default | Purpose | +|----------|---------|---------| +| MONGO_CLUSTER_NAME | (required) | DocumentDB cluster name | +| AZURE_OPENAI_EMBEDDING_ENDPOINT | (required) | Azure OpenAI endpoint | +| AZURE_OPENAI_EMBEDDING_MODEL | (required) | Embedding model deployment | +| DATA_FILE_WITH_VECTORS | ./Hotels_Vector.json | Path to data file | +| EMBEDDED_FIELD | DescriptionVector | Vector field name in documents | +| EMBEDDING_DIMENSIONS | 1536 | Vector dimensions | +| LOAD_SIZE_BATCH | 100 | Batch size for document insertion | +| EMBEDDING_SIZE_BATCH | 16 | Batch size for embedding generation | +| AZURE_DOCUMENTDB_DATABASENAME | Hotels | Database name | +| SIMILARITY | (varies) | Similarity metric (cosine, euclidean, ip) | +| ALGORITHM | (varies) | Algorithm (ivf, hnsw, diskann) | + +## Consistent Algorithm Parameters + +### IVF +- numLists: 1 +- nProbes: 1 + +### HNSW +- m: 16 +- efConstruction: 64 +- efSearch: 40 + +### DiskANN +- maxDegree: 20 +- lBuild: 10 +- lSearch: 40 + +## Rules + +1. **No Cosmos DB references.** Never use "Cosmos DB", "cosmosdb", "MongoDB vCore", or "mongo.cosmos.azure.com". Always use "Azure DocumentDB" and "documentdb.azure.com". +2. **Vector field name is DescriptionVector.** Never default to "contentVector". +3. **Data file is shared.** All samples reference `../data/Hotels_Vector.json`. READMEs instruct users to copy it locally. +4. **Batch size is LOAD_SIZE_BATCH=100.** Do not use BATCH_SIZE or other variants. +5. **Database name variable is AZURE_DOCUMENTDB_DATABASENAME.** Do not use MONGO_DB_NAME or other variants. +6. **.NET uses appsettings.json** with same variable names under a "DocumentDB" section. From dcaa7f70566c7221a9ab89efe6f78eb23c58b968 Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 8 May 2026 09:54:29 -0700 Subject: [PATCH 2/8] fix: update copilot-instructions with missing deps, correct data path, and code patterns - Add OpenAI SDK dependencies for Go, Java, and .NET - Add python-dotenv and godotenv dependencies - Fix DATA_FILE_WITH_VECTORS default from ./Hotels_Vector.json to ../data/Hotels_Vector.json - Add AZURE_OPENAI_EMBEDDING_API_VERSION and MONGO_CONNECTION_STRING to env var table - Add Authentication section documenting passwordless (OIDC) and connection string auth - Add Sample Execution Pattern section with consistent lifecycle, naming conventions, standard search query, and vector search pipeline structure - Add Repository Structure overview - Clarify Rule 1 exceptions (mongocluster.cosmos.azure.com, cosmosSearch, VS Code extension) - Fix Rule 3 to reference env var with shared data path default - Fix Rule 6 to list actual .NET config sections - Add rules for COS similarity, output files, and index type availability Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/copilot-instructions.md | 99 +++++++++++++++++++++++++++++---- 1 file changed, 89 insertions(+), 10 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index bdc2ae9..b8a4db8 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -3,6 +3,26 @@ ## Project Overview Azure DocumentDB code samples for vector search and algorithm selection quickstart articles. +## Repository Structure + +``` +ai/ +├── data/ # Shared data files (Hotels.json, Hotels_Vector.json) +├── vector-search-python/ # Python vector search samples +├── vector-search-typescript/ # TypeScript/Node.js vector search samples +├── vector-search-go/ # Go vector search samples +├── vector-search-java/ # Java vector search samples +├── vector-search-dotnet/ # .NET vector search samples +├── vector-search-agent-go/ # Go agent sample (separate from quickstart) +└── vector-search-agent-typescript/ # TypeScript agent sample (separate from quickstart) +``` + +Each vector-search sample directory contains: +- `src/` — Source files: one per algorithm (`ivf`, `hnsw`, `diskann`) + `utils` + `create_embeddings` + `show_indexes` +- `output/` — Expected output files: `ivf.txt`, `hnsw.txt`, `diskann.txt` +- `README.md` — Setup, usage, and troubleshooting documentation +- `.env.example` (Go, Python, TypeScript) or `appsettings.json` (.NET) — Configuration template + ## Language Dependencies ### Go @@ -10,11 +30,14 @@ Azure DocumentDB code samples for vector search and algorithm selection quicksta - go.mongodb.org/mongo-driver v1.17+ - github.com/Azure/azure-sdk-for-go/sdk/azidentity - github.com/Azure/azure-sdk-for-go/sdk/azcore +- github.com/openai/openai-go/v3 +- github.com/joho/godotenv ### Java - Java 17+ -- MongoDB Driver 5.3+ -- Azure Identity 1.15+ +- MongoDB Driver (mongodb-driver-sync) 5.3+ +- Azure Identity (azure-identity) 1.15+ +- Azure AI OpenAI (azure-ai-openai) - Maven 3.8+ ### Python @@ -22,6 +45,7 @@ Azure DocumentDB code samples for vector search and algorithm selection quicksta - pymongo >= 4.7 - azure-identity - openai +- python-dotenv ### TypeScript/Node.js - Node.js 20+ @@ -31,8 +55,9 @@ Azure DocumentDB code samples for vector search and algorithm selection quicksta ### .NET - .NET 8+ -- MongoDB.Driver 3.2+ +- MongoDB.Driver 3.0+ - Azure.Identity +- Azure.AI.OpenAI ## Consistent Variable Values @@ -40,16 +65,18 @@ All samples MUST use these environment variable names and defaults: | Variable | Default | Purpose | |----------|---------|---------| -| MONGO_CLUSTER_NAME | (required) | DocumentDB cluster name | +| MONGO_CLUSTER_NAME | (required) | DocumentDB cluster name (passwordless auth) | +| MONGO_CONNECTION_STRING | (none) | Full connection string (connection string auth) | | AZURE_OPENAI_EMBEDDING_ENDPOINT | (required) | Azure OpenAI endpoint | -| AZURE_OPENAI_EMBEDDING_MODEL | (required) | Embedding model deployment | -| DATA_FILE_WITH_VECTORS | ./Hotels_Vector.json | Path to data file | +| AZURE_OPENAI_EMBEDDING_MODEL | (required) | Embedding model deployment name | +| AZURE_OPENAI_EMBEDDING_API_VERSION | 2023-05-15 | Azure OpenAI API version | +| DATA_FILE_WITH_VECTORS | ../data/Hotels_Vector.json | Path to data file with embeddings | | EMBEDDED_FIELD | DescriptionVector | Vector field name in documents | | EMBEDDING_DIMENSIONS | 1536 | Vector dimensions | | LOAD_SIZE_BATCH | 100 | Batch size for document insertion | | EMBEDDING_SIZE_BATCH | 16 | Batch size for embedding generation | | AZURE_DOCUMENTDB_DATABASENAME | Hotels | Database name | -| SIMILARITY | (varies) | Similarity metric (cosine, euclidean, ip) | +| SIMILARITY | (varies) | Similarity metric (COS, euclidean, ip) | | ALGORITHM | (varies) | Algorithm (ivf, hnsw, diskann) | ## Consistent Algorithm Parameters @@ -68,11 +95,63 @@ All samples MUST use these environment variable names and defaults: - lBuild: 10 - lSearch: 40 +## Authentication + +All samples support two authentication modes. **Passwordless (OIDC) is preferred.** + +### Passwordless Authentication (Recommended) +- Uses `DefaultAzureCredential` / OIDC with `MONGO_CLUSTER_NAME` +- Connection URI format: `mongodb+srv://{clusterName}.global.mongocluster.cosmos.azure.com/` +- OIDC token scope: `https://ossrdbms-aad.database.windows.net/.default` +- Each language implements a utility function pair: `getClients()` and `getClientsPasswordless()` + +### Connection String Authentication +- Uses `MONGO_CONNECTION_STRING` with username/password +- Format: `mongodb+srv://username:password@{cluster}.mongocluster.cosmos.azure.com/?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000` + +> **Note:** `mongocluster.cosmos.azure.com` is the current DocumentDB hostname — this is NOT a Cosmos DB reference. + +## Sample Execution Pattern + +All vector search samples follow this consistent lifecycle: + +1. **Initialize clients** — Create MongoDB and Azure OpenAI clients (passwordless preferred) +2. **Drop collection** — Drop the algorithm-specific collection if it exists (clean start) +3. **Create collection** — Create a fresh collection +4. **Load data** — Read `Hotels_Vector.json` and batch-insert documents +5. **Create vector index** — Create algorithm-specific vector index using `createIndexes` command with `cosmosSearch` key type +6. **Generate query embedding** — Embed the search query text using Azure OpenAI +7. **Perform vector search** — Run `$search` aggregation pipeline with `cosmosSearch` operator +8. **Print results** — Display `HotelName` and `score` for top results +9. **Cleanup** — Drop the collection in a `finally`/`defer` block + +### Naming Conventions +- **Collection names:** `hotels_{algorithm}` — e.g., `hotels_ivf`, `hotels_hnsw`, `hotels_diskann` +- **Index names:** `vectorIndex_{algorithm}` — e.g., `vectorIndex_ivf`, `vectorIndex_hnsw`, `vectorIndex_diskann` +- **Database name:** `Hotels` (hardcoded, matches `AZURE_DOCUMENTDB_DATABASENAME` default) + +### Standard Search Query +All samples use the same query text: `"quintessential lodging near running trails, eateries, retail"` + +### Vector Search Pipeline Structure +All languages use the same aggregation pipeline structure: +``` +[ + { "$search": { "cosmosSearch": { "vector": , "path": , "k": 5 } } }, + { "$project": { "score": { "$meta": "searchScore" }, "document": "$$ROOT" } } +] +``` + +> **Note:** `cosmosSearch` is a valid MongoDB API command name for DocumentDB — this is NOT a Cosmos DB reference. + ## Rules -1. **No Cosmos DB references.** Never use "Cosmos DB", "cosmosdb", "MongoDB vCore", or "mongo.cosmos.azure.com". Always use "Azure DocumentDB" and "documentdb.azure.com". +1. **No Cosmos DB references.** Never use "Cosmos DB", "cosmosdb", "MongoDB vCore", or "mongo.cosmos.azure.com". Always use "Azure DocumentDB" and "documentdb.azure.com". Exception: `mongocluster.cosmos.azure.com` (hostname), `cosmosSearch` (API command), and `ms-azuretools.vscode-cosmosdb` (VS Code extension) are valid and NOT Cosmos references. 2. **Vector field name is DescriptionVector.** Never default to "contentVector". -3. **Data file is shared.** All samples reference `../data/Hotels_Vector.json`. READMEs instruct users to copy it locally. +3. **Data file path from env var.** Code reads `DATA_FILE_WITH_VECTORS` which defaults to `../data/Hotels_Vector.json` (the shared data location). .NET copies data locally to `data/Hotels_Vector.json` in the build output. 4. **Batch size is LOAD_SIZE_BATCH=100.** Do not use BATCH_SIZE or other variants. 5. **Database name variable is AZURE_DOCUMENTDB_DATABASENAME.** Do not use MONGO_DB_NAME or other variants. -6. **.NET uses appsettings.json** with same variable names under a "DocumentDB" section. +6. **.NET uses appsettings.json** with configuration sections: `AzureOpenAI`, `DataFiles`, `Embedding`, `MongoDB`, `VectorSearch`. +7. **Similarity metric is COS.** All vector index definitions use `"similarity": "COS"` (cosine similarity). +8. **Output files are committed.** Each sample has an `output/` directory with expected output for each algorithm (`ivf.txt`, `hnsw.txt`, `diskann.txt`). Update these when output format changes. +9. **DocumentDB supports all index types at any dataset size.** IVF, HNSW, and DiskANN are all available — do not imply tier restrictions limit algorithm availability. From 870709f51d720fa214c876f279b19ac0b91f8288 Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 8 May 2026 10:03:30 -0700 Subject: [PATCH 3/8] docs: add agent samples multi-LLM convention to copilot-instructions Document the agent sample patterns as a separate convention section covering: - Planner/Synthesizer two-agent architecture with 3 LLM deployments - Agent-specific env vars vs quickstart env var mapping - Entry point pattern (upload/agent/cleanup) - Language-specific SDK stacks (Go raw vs TS LangChain) - OIDC authentication scopes - IVF numLists discrepancy (10 vs 1) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/copilot-instructions.md | 89 ++++++++++++++++++++++++++++++++- 1 file changed, 88 insertions(+), 1 deletion(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index b8a4db8..37bd396 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -144,9 +144,96 @@ All languages use the same aggregation pipeline structure: > **Note:** `cosmosSearch` is a valid MongoDB API command name for DocumentDB — this is NOT a Cosmos DB reference. +## Agent Samples (Multi-LLM Convention) + +Agent samples (`vector-search-agent-*`) use a **different convention** from quickstart samples. They orchestrate multiple LLM deployments and use a distinct set of environment variables. Do NOT mix agent conventions with quickstart conventions. + +### Architecture: Planner → Synthesizer + +Agent samples use a two-agent pipeline with three Azure OpenAI deployments: + +| Deployment | Role | Temperature | Purpose | +|------------|------|-------------|---------| +| Embedding | Vector search | — | Same as quickstart samples | +| Planner | Tool-calling agent | 0.0 | Transforms user query → tool call → retrieves search results | +| Synthesizer | Response generation | 0.3 | Takes search results + query → produces natural language recommendation | + +The planner invokes a `search_hotels_collection` tool that performs the vector search. The synthesizer receives the search results and generates a comparative hotel recommendation. + +### Agent Entry Points + +Agent samples have three separate entry points (not a single main file): + +| Entry Point | Purpose | +|-------------|---------| +| `upload` | Load hotel data, create embeddings, insert into DocumentDB, create vector index | +| `agent` | Run planner → synthesizer pipeline against an existing collection | +| `cleanup` | Drop the database | + +### Agent Environment Variables + +Agent samples use `AZURE_DOCUMENTDB_*` and `AZURE_OPENAI_*` prefixes consistently. These differ from quickstart variable names. + +| Agent Variable | Quickstart Equivalent | Notes | +|---------------|----------------------|-------| +| `AZURE_OPENAI_ENDPOINT` | `AZURE_OPENAI_EMBEDDING_ENDPOINT` | Single endpoint for all 3 deployments | +| `AZURE_OPENAI_API_KEY` | — | For API key auth (not used in quickstarts) | +| `AZURE_DOCUMENTDB_CLUSTER` | `MONGO_CLUSTER_NAME` | Cluster name for passwordless auth | +| `AZURE_DOCUMENTDB_CONNECTION_STRING` | `MONGO_CONNECTION_STRING` | Full connection string | +| `AZURE_DOCUMENTDB_COLLECTION` | — | Collection name (agents parameterize this) | +| `AZURE_DOCUMENTDB_INDEX_NAME` | — | Vector index name (agents parameterize this) | +| `VECTOR_INDEX_ALGORITHM` | `ALGORITHM` | Default: `vector-ivf` | +| `VECTOR_SIMILARITY` | `SIMILARITY` | Default: `COS` | +| `USE_PASSWORDLESS` | — | `true`/`false` toggle for auth mode | +| `DEBUG` | — | `true`/`false` verbose logging | +| `QUERY` | — | Default: `"quintessential lodging near running trails, eateries, retail"` | +| `NEAREST_NEIGHBORS` | — | Default: `5` | + +**Agent-only variables (no quickstart equivalent):** + +| Variable | Default | Purpose | +|----------|---------|---------| +| `AZURE_OPENAI_EMBEDDING_DEPLOYMENT` | (required) | Embedding model deployment name | +| `AZURE_OPENAI_EMBEDDING_API_VERSION` | 2024-06-01 (Go), 2023-05-15 (TS) | Embedding API version | +| `AZURE_OPENAI_PLANNER_DEPLOYMENT` / `AZURE_OPENAI_PLANNER_MODEL` | (required) | Planner LLM deployment | +| `AZURE_OPENAI_PLANNER_API_VERSION` | (required) | Planner API version | +| `AZURE_OPENAI_SYNTH_DEPLOYMENT` / `AZURE_OPENAI_SYNTH_MODEL` | (required) | Synthesizer LLM deployment | +| `AZURE_OPENAI_SYNTH_API_VERSION` | (required) | Synthesizer API version | +| `IVF_NUM_LISTS` | 10 | IVF numLists (⚠️ differs from quickstart default of 1) | +| `HNSW_M` | 16 | HNSW m parameter | +| `HNSW_EF_CONSTRUCTION` | 64 | HNSW efConstruction parameter | +| `DISKANN_MAX_DEGREE` | 20 | DiskANN maxDegree parameter | +| `DISKANN_L_BUILD` | 10 | DiskANN lBuild parameter | + +### Agent Authentication + +Agents support passwordless (OIDC) and API key auth, toggled by `USE_PASSWORDLESS`. + +**OIDC scopes:** +- DocumentDB: `https://ossrdbms-aad.database.windows.net/.default` +- Azure OpenAI: `https://cognitiveservices.azure.com/.default` + +**MongoDB URI (passwordless):** `mongodb+srv://{cluster}.global.mongocluster.cosmos.azure.com/` +- Auth mechanism: `MONGODB-OIDC` with machine callback + +### Language-Specific SDK Stacks + +| Language | MongoDB | OpenAI | Agent Framework | +|----------|---------|--------|-----------------| +| Go | `go.mongodb.org/mongo-driver` (raw) | `github.com/openai/openai-go/v3` (raw) | Manual tool-calling loop | +| TypeScript | `mongodb` (cleanup only) | `@langchain/openai` | `langchain` + `@langchain/azure-cosmosdb` + `zod` | + +**TypeScript agents use LangChain** — the `@langchain/azure-cosmosdb` package manages the vector store, and `langchain`'s `createAgent` handles tool orchestration. This is a fundamentally different SDK stack from the quickstart TypeScript samples which use the raw MongoDB driver. + +**Go agents use raw SDKs** — both MongoDB driver and OpenAI SDK are used directly, with manual tool-calling implementation. + +### IVF numLists Discrepancy + +Agent samples default to `IVF_NUM_LISTS=10`. Quickstart samples (vector-search, select-algorithm) hardcode `numLists=1`. This is intentional — agent samples are designed for tunable, production-like configurations while quickstart samples use minimal values for simplicity. + ## Rules -1. **No Cosmos DB references.** Never use "Cosmos DB", "cosmosdb", "MongoDB vCore", or "mongo.cosmos.azure.com". Always use "Azure DocumentDB" and "documentdb.azure.com". Exception: `mongocluster.cosmos.azure.com` (hostname), `cosmosSearch` (API command), and `ms-azuretools.vscode-cosmosdb` (VS Code extension) are valid and NOT Cosmos references. +1. **No Cosmos DB references.**Never use "Cosmos DB", "cosmosdb", "MongoDB vCore", or "mongo.cosmos.azure.com". Always use "Azure DocumentDB" and "documentdb.azure.com". Exception: `mongocluster.cosmos.azure.com` (hostname), `cosmosSearch` (API command), and `ms-azuretools.vscode-cosmosdb` (VS Code extension) are valid and NOT Cosmos references. 2. **Vector field name is DescriptionVector.** Never default to "contentVector". 3. **Data file path from env var.** Code reads `DATA_FILE_WITH_VECTORS` which defaults to `../data/Hotels_Vector.json` (the shared data location). .NET copies data locally to `data/Hotels_Vector.json` in the build output. 4. **Batch size is LOAD_SIZE_BATCH=100.** Do not use BATCH_SIZE or other variants. From 0046a4e3b447dc437a58d1d99d45dd3985caa58c Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 8 May 2026 10:07:45 -0700 Subject: [PATCH 4/8] docs: replace dotenv with CLI env var invocation examples - Remove python-dotenv and godotenv from dependency lists - Add Rule 10: no dotenv libraries - Add 'Running Samples' section with per-language CLI examples - Include agent sample multi-LLM invocation example - Add Windows PowerShell note Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/copilot-instructions.md | 76 ++++++++++++++++++++++++++++++++- 1 file changed, 74 insertions(+), 2 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 37bd396..584c6e6 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -31,7 +31,6 @@ Each vector-search sample directory contains: - github.com/Azure/azure-sdk-for-go/sdk/azidentity - github.com/Azure/azure-sdk-for-go/sdk/azcore - github.com/openai/openai-go/v3 -- github.com/joho/godotenv ### Java - Java 17+ @@ -45,7 +44,6 @@ Each vector-search sample directory contains: - pymongo >= 4.7 - azure-identity - openai -- python-dotenv ### TypeScript/Node.js - Node.js 20+ @@ -242,3 +240,77 @@ Agent samples default to `IVF_NUM_LISTS=10`. Quickstart samples (vector-search, 7. **Similarity metric is COS.** All vector index definitions use `"similarity": "COS"` (cosine similarity). 8. **Output files are committed.** Each sample has an `output/` directory with expected output for each algorithm (`ivf.txt`, `hnsw.txt`, `diskann.txt`). Update these when output format changes. 9. **DocumentDB supports all index types at any dataset size.** IVF, HNSW, and DiskANN are all available — do not imply tier restrictions limit algorithm availability. +10. **No dotenv libraries.** Do NOT use `python-dotenv`, `godotenv`, `dotenv` (npm), or any `.env` file-loading library. Environment variables must be passed via the CLI invocation, not loaded from `.env` files at runtime. This keeps samples explicit and avoids hidden configuration. + +## Running Samples — CLI Invocation + +Environment variables are passed inline with the run command. Do NOT use `.env` files. Each example below shows the required variables for a vector-search quickstart sample. + +### Go + +```bash +MONGO_CLUSTER_NAME=myCluster \ +AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ +go run ./src/ivf.go +``` + +### Python + +```bash +MONGO_CLUSTER_NAME=myCluster \ +AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ +python src/ivf.py +``` + +### TypeScript/Node.js + +```bash +MONGO_CLUSTER_NAME=myCluster \ +AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ +npx tsx src/ivf.ts +``` + +### Java + +```bash +MONGO_CLUSTER_NAME=myCluster \ +AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ +mvn compile exec:java -Dexec.mainClass="com.azure.documentdb.sample.IVF" +``` + +### .NET + +.NET uses `appsettings.json` for configuration, but environment variables can override: + +```bash +DocumentDB__ClusterName=myCluster \ +AzureOpenAI__Endpoint=https://myendpoint.openai.azure.com/ \ +AzureOpenAI__DeploymentName=text-embedding-ada-002 \ +dotnet run +``` + +### Agent Samples (Multi-LLM) + +Agent samples require more variables for the planner and synthesizer deployments: + +```bash +AZURE_OPENAI_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002 \ +AZURE_OPENAI_EMBEDDING_API_VERSION=2024-06-01 \ +AZURE_OPENAI_PLANNER_DEPLOYMENT=gpt-4o \ +AZURE_OPENAI_PLANNER_API_VERSION=2024-06-01 \ +AZURE_OPENAI_SYNTH_DEPLOYMENT=gpt-4o \ +AZURE_OPENAI_SYNTH_API_VERSION=2024-06-01 \ +AZURE_DOCUMENTDB_CLUSTER=myCluster \ +AZURE_DOCUMENTDB_DATABASENAME=Hotels \ +AZURE_DOCUMENTDB_COLLECTION=hotels \ +AZURE_DOCUMENTDB_INDEX_NAME=vectorIndex \ +USE_PASSWORDLESS=true \ +go run ./cmd/agent/main.go +``` + +> **Windows (PowerShell):** Use `$env:VAR_NAME="value";` syntax or set variables beforehand with `$env:MONGO_CLUSTER_NAME="myCluster"` then run the command separately. From 9fec92f1cdd5fa5efe817ddce8f097d590b5867f Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 8 May 2026 10:12:58 -0700 Subject: [PATCH 5/8] docs: add PowerShell examples alongside bash for all CLI invocations Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/copilot-instructions.md | 63 ++++++++++++++++++++++++++++++++- 1 file changed, 62 insertions(+), 1 deletion(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 584c6e6..c5a575e 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -248,6 +248,7 @@ Environment variables are passed inline with the run command. Do NOT use `.env` ### Go +**Bash:** ```bash MONGO_CLUSTER_NAME=myCluster \ AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ @@ -255,8 +256,17 @@ AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ go run ./src/ivf.go ``` +**PowerShell:** +```powershell +$env:MONGO_CLUSTER_NAME="myCluster" +$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" +go run ./src/ivf.go +``` + ### Python +**Bash:** ```bash MONGO_CLUSTER_NAME=myCluster \ AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ @@ -264,8 +274,17 @@ AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ python src/ivf.py ``` +**PowerShell:** +```powershell +$env:MONGO_CLUSTER_NAME="myCluster" +$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" +python src/ivf.py +``` + ### TypeScript/Node.js +**Bash:** ```bash MONGO_CLUSTER_NAME=myCluster \ AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ @@ -273,8 +292,17 @@ AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ npx tsx src/ivf.ts ``` +**PowerShell:** +```powershell +$env:MONGO_CLUSTER_NAME="myCluster" +$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" +npx tsx src/ivf.ts +``` + ### Java +**Bash:** ```bash MONGO_CLUSTER_NAME=myCluster \ AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ @@ -282,10 +310,19 @@ AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ mvn compile exec:java -Dexec.mainClass="com.azure.documentdb.sample.IVF" ``` +**PowerShell:** +```powershell +$env:MONGO_CLUSTER_NAME="myCluster" +$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" +mvn compile exec:java -Dexec.mainClass="com.azure.documentdb.sample.IVF" +``` + ### .NET .NET uses `appsettings.json` for configuration, but environment variables can override: +**Bash:** ```bash DocumentDB__ClusterName=myCluster \ AzureOpenAI__Endpoint=https://myendpoint.openai.azure.com/ \ @@ -293,10 +330,19 @@ AzureOpenAI__DeploymentName=text-embedding-ada-002 \ dotnet run ``` +**PowerShell:** +```powershell +$env:DocumentDB__ClusterName="myCluster" +$env:AzureOpenAI__Endpoint="https://myendpoint.openai.azure.com/" +$env:AzureOpenAI__DeploymentName="text-embedding-ada-002" +dotnet run +``` + ### Agent Samples (Multi-LLM) Agent samples require more variables for the planner and synthesizer deployments: +**Bash:** ```bash AZURE_OPENAI_ENDPOINT=https://myendpoint.openai.azure.com/ \ AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002 \ @@ -313,4 +359,19 @@ USE_PASSWORDLESS=true \ go run ./cmd/agent/main.go ``` -> **Windows (PowerShell):** Use `$env:VAR_NAME="value";` syntax or set variables beforehand with `$env:MONGO_CLUSTER_NAME="myCluster"` then run the command separately. +**PowerShell:** +```powershell +$env:AZURE_OPENAI_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_DEPLOYMENT="text-embedding-ada-002" +$env:AZURE_OPENAI_EMBEDDING_API_VERSION="2024-06-01" +$env:AZURE_OPENAI_PLANNER_DEPLOYMENT="gpt-4o" +$env:AZURE_OPENAI_PLANNER_API_VERSION="2024-06-01" +$env:AZURE_OPENAI_SYNTH_DEPLOYMENT="gpt-4o" +$env:AZURE_OPENAI_SYNTH_API_VERSION="2024-06-01" +$env:AZURE_DOCUMENTDB_CLUSTER="myCluster" +$env:AZURE_DOCUMENTDB_DATABASENAME="Hotels" +$env:AZURE_DOCUMENTDB_COLLECTION="hotels" +$env:AZURE_DOCUMENTDB_INDEX_NAME="vectorIndex" +$env:USE_PASSWORDLESS="true" +go run ./cmd/agent/main.go +``` From 89e7ae4aa3a3365f562d846c3e84b190394b4100 Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 8 May 2026 10:23:23 -0700 Subject: [PATCH 6/8] refactor: split copilot-instructions into scoped instruction files - Main file trimmed from 377 to 156 lines (core rules, env vars, patterns) - .github/instructions/cli-examples.instructions.md: Bash/PowerShell invocation examples, scoped to ai/** paths - .github/instructions/agent-samples.instructions.md: Multi-LLM agent convention, scoped to ai/vector-search-agent-*/** paths Follows GitHub Copilot guidance to keep copilot-instructions.md concise and use path-specific .instructions.md files for detailed/scoped content. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/copilot-instructions.md | 221 ------------------ .../agent-samples.instructions.md | 89 +++++++ .../instructions/cli-examples.instructions.md | 136 +++++++++++ 3 files changed, 225 insertions(+), 221 deletions(-) create mode 100644 .github/instructions/agent-samples.instructions.md create mode 100644 .github/instructions/cli-examples.instructions.md diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index c5a575e..732fca5 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -142,93 +142,6 @@ All languages use the same aggregation pipeline structure: > **Note:** `cosmosSearch` is a valid MongoDB API command name for DocumentDB — this is NOT a Cosmos DB reference. -## Agent Samples (Multi-LLM Convention) - -Agent samples (`vector-search-agent-*`) use a **different convention** from quickstart samples. They orchestrate multiple LLM deployments and use a distinct set of environment variables. Do NOT mix agent conventions with quickstart conventions. - -### Architecture: Planner → Synthesizer - -Agent samples use a two-agent pipeline with three Azure OpenAI deployments: - -| Deployment | Role | Temperature | Purpose | -|------------|------|-------------|---------| -| Embedding | Vector search | — | Same as quickstart samples | -| Planner | Tool-calling agent | 0.0 | Transforms user query → tool call → retrieves search results | -| Synthesizer | Response generation | 0.3 | Takes search results + query → produces natural language recommendation | - -The planner invokes a `search_hotels_collection` tool that performs the vector search. The synthesizer receives the search results and generates a comparative hotel recommendation. - -### Agent Entry Points - -Agent samples have three separate entry points (not a single main file): - -| Entry Point | Purpose | -|-------------|---------| -| `upload` | Load hotel data, create embeddings, insert into DocumentDB, create vector index | -| `agent` | Run planner → synthesizer pipeline against an existing collection | -| `cleanup` | Drop the database | - -### Agent Environment Variables - -Agent samples use `AZURE_DOCUMENTDB_*` and `AZURE_OPENAI_*` prefixes consistently. These differ from quickstart variable names. - -| Agent Variable | Quickstart Equivalent | Notes | -|---------------|----------------------|-------| -| `AZURE_OPENAI_ENDPOINT` | `AZURE_OPENAI_EMBEDDING_ENDPOINT` | Single endpoint for all 3 deployments | -| `AZURE_OPENAI_API_KEY` | — | For API key auth (not used in quickstarts) | -| `AZURE_DOCUMENTDB_CLUSTER` | `MONGO_CLUSTER_NAME` | Cluster name for passwordless auth | -| `AZURE_DOCUMENTDB_CONNECTION_STRING` | `MONGO_CONNECTION_STRING` | Full connection string | -| `AZURE_DOCUMENTDB_COLLECTION` | — | Collection name (agents parameterize this) | -| `AZURE_DOCUMENTDB_INDEX_NAME` | — | Vector index name (agents parameterize this) | -| `VECTOR_INDEX_ALGORITHM` | `ALGORITHM` | Default: `vector-ivf` | -| `VECTOR_SIMILARITY` | `SIMILARITY` | Default: `COS` | -| `USE_PASSWORDLESS` | — | `true`/`false` toggle for auth mode | -| `DEBUG` | — | `true`/`false` verbose logging | -| `QUERY` | — | Default: `"quintessential lodging near running trails, eateries, retail"` | -| `NEAREST_NEIGHBORS` | — | Default: `5` | - -**Agent-only variables (no quickstart equivalent):** - -| Variable | Default | Purpose | -|----------|---------|---------| -| `AZURE_OPENAI_EMBEDDING_DEPLOYMENT` | (required) | Embedding model deployment name | -| `AZURE_OPENAI_EMBEDDING_API_VERSION` | 2024-06-01 (Go), 2023-05-15 (TS) | Embedding API version | -| `AZURE_OPENAI_PLANNER_DEPLOYMENT` / `AZURE_OPENAI_PLANNER_MODEL` | (required) | Planner LLM deployment | -| `AZURE_OPENAI_PLANNER_API_VERSION` | (required) | Planner API version | -| `AZURE_OPENAI_SYNTH_DEPLOYMENT` / `AZURE_OPENAI_SYNTH_MODEL` | (required) | Synthesizer LLM deployment | -| `AZURE_OPENAI_SYNTH_API_VERSION` | (required) | Synthesizer API version | -| `IVF_NUM_LISTS` | 10 | IVF numLists (⚠️ differs from quickstart default of 1) | -| `HNSW_M` | 16 | HNSW m parameter | -| `HNSW_EF_CONSTRUCTION` | 64 | HNSW efConstruction parameter | -| `DISKANN_MAX_DEGREE` | 20 | DiskANN maxDegree parameter | -| `DISKANN_L_BUILD` | 10 | DiskANN lBuild parameter | - -### Agent Authentication - -Agents support passwordless (OIDC) and API key auth, toggled by `USE_PASSWORDLESS`. - -**OIDC scopes:** -- DocumentDB: `https://ossrdbms-aad.database.windows.net/.default` -- Azure OpenAI: `https://cognitiveservices.azure.com/.default` - -**MongoDB URI (passwordless):** `mongodb+srv://{cluster}.global.mongocluster.cosmos.azure.com/` -- Auth mechanism: `MONGODB-OIDC` with machine callback - -### Language-Specific SDK Stacks - -| Language | MongoDB | OpenAI | Agent Framework | -|----------|---------|--------|-----------------| -| Go | `go.mongodb.org/mongo-driver` (raw) | `github.com/openai/openai-go/v3` (raw) | Manual tool-calling loop | -| TypeScript | `mongodb` (cleanup only) | `@langchain/openai` | `langchain` + `@langchain/azure-cosmosdb` + `zod` | - -**TypeScript agents use LangChain** — the `@langchain/azure-cosmosdb` package manages the vector store, and `langchain`'s `createAgent` handles tool orchestration. This is a fundamentally different SDK stack from the quickstart TypeScript samples which use the raw MongoDB driver. - -**Go agents use raw SDKs** — both MongoDB driver and OpenAI SDK are used directly, with manual tool-calling implementation. - -### IVF numLists Discrepancy - -Agent samples default to `IVF_NUM_LISTS=10`. Quickstart samples (vector-search, select-algorithm) hardcode `numLists=1`. This is intentional — agent samples are designed for tunable, production-like configurations while quickstart samples use minimal values for simplicity. - ## Rules 1. **No Cosmos DB references.**Never use "Cosmos DB", "cosmosdb", "MongoDB vCore", or "mongo.cosmos.azure.com". Always use "Azure DocumentDB" and "documentdb.azure.com". Exception: `mongocluster.cosmos.azure.com` (hostname), `cosmosSearch` (API command), and `ms-azuretools.vscode-cosmosdb` (VS Code extension) are valid and NOT Cosmos references. @@ -241,137 +154,3 @@ Agent samples default to `IVF_NUM_LISTS=10`. Quickstart samples (vector-search, 8. **Output files are committed.** Each sample has an `output/` directory with expected output for each algorithm (`ivf.txt`, `hnsw.txt`, `diskann.txt`). Update these when output format changes. 9. **DocumentDB supports all index types at any dataset size.** IVF, HNSW, and DiskANN are all available — do not imply tier restrictions limit algorithm availability. 10. **No dotenv libraries.** Do NOT use `python-dotenv`, `godotenv`, `dotenv` (npm), or any `.env` file-loading library. Environment variables must be passed via the CLI invocation, not loaded from `.env` files at runtime. This keeps samples explicit and avoids hidden configuration. - -## Running Samples — CLI Invocation - -Environment variables are passed inline with the run command. Do NOT use `.env` files. Each example below shows the required variables for a vector-search quickstart sample. - -### Go - -**Bash:** -```bash -MONGO_CLUSTER_NAME=myCluster \ -AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ -AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ -go run ./src/ivf.go -``` - -**PowerShell:** -```powershell -$env:MONGO_CLUSTER_NAME="myCluster" -$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" -$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" -go run ./src/ivf.go -``` - -### Python - -**Bash:** -```bash -MONGO_CLUSTER_NAME=myCluster \ -AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ -AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ -python src/ivf.py -``` - -**PowerShell:** -```powershell -$env:MONGO_CLUSTER_NAME="myCluster" -$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" -$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" -python src/ivf.py -``` - -### TypeScript/Node.js - -**Bash:** -```bash -MONGO_CLUSTER_NAME=myCluster \ -AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ -AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ -npx tsx src/ivf.ts -``` - -**PowerShell:** -```powershell -$env:MONGO_CLUSTER_NAME="myCluster" -$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" -$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" -npx tsx src/ivf.ts -``` - -### Java - -**Bash:** -```bash -MONGO_CLUSTER_NAME=myCluster \ -AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ -AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ -mvn compile exec:java -Dexec.mainClass="com.azure.documentdb.sample.IVF" -``` - -**PowerShell:** -```powershell -$env:MONGO_CLUSTER_NAME="myCluster" -$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" -$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" -mvn compile exec:java -Dexec.mainClass="com.azure.documentdb.sample.IVF" -``` - -### .NET - -.NET uses `appsettings.json` for configuration, but environment variables can override: - -**Bash:** -```bash -DocumentDB__ClusterName=myCluster \ -AzureOpenAI__Endpoint=https://myendpoint.openai.azure.com/ \ -AzureOpenAI__DeploymentName=text-embedding-ada-002 \ -dotnet run -``` - -**PowerShell:** -```powershell -$env:DocumentDB__ClusterName="myCluster" -$env:AzureOpenAI__Endpoint="https://myendpoint.openai.azure.com/" -$env:AzureOpenAI__DeploymentName="text-embedding-ada-002" -dotnet run -``` - -### Agent Samples (Multi-LLM) - -Agent samples require more variables for the planner and synthesizer deployments: - -**Bash:** -```bash -AZURE_OPENAI_ENDPOINT=https://myendpoint.openai.azure.com/ \ -AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002 \ -AZURE_OPENAI_EMBEDDING_API_VERSION=2024-06-01 \ -AZURE_OPENAI_PLANNER_DEPLOYMENT=gpt-4o \ -AZURE_OPENAI_PLANNER_API_VERSION=2024-06-01 \ -AZURE_OPENAI_SYNTH_DEPLOYMENT=gpt-4o \ -AZURE_OPENAI_SYNTH_API_VERSION=2024-06-01 \ -AZURE_DOCUMENTDB_CLUSTER=myCluster \ -AZURE_DOCUMENTDB_DATABASENAME=Hotels \ -AZURE_DOCUMENTDB_COLLECTION=hotels \ -AZURE_DOCUMENTDB_INDEX_NAME=vectorIndex \ -USE_PASSWORDLESS=true \ -go run ./cmd/agent/main.go -``` - -**PowerShell:** -```powershell -$env:AZURE_OPENAI_ENDPOINT="https://myendpoint.openai.azure.com/" -$env:AZURE_OPENAI_EMBEDDING_DEPLOYMENT="text-embedding-ada-002" -$env:AZURE_OPENAI_EMBEDDING_API_VERSION="2024-06-01" -$env:AZURE_OPENAI_PLANNER_DEPLOYMENT="gpt-4o" -$env:AZURE_OPENAI_PLANNER_API_VERSION="2024-06-01" -$env:AZURE_OPENAI_SYNTH_DEPLOYMENT="gpt-4o" -$env:AZURE_OPENAI_SYNTH_API_VERSION="2024-06-01" -$env:AZURE_DOCUMENTDB_CLUSTER="myCluster" -$env:AZURE_DOCUMENTDB_DATABASENAME="Hotels" -$env:AZURE_DOCUMENTDB_COLLECTION="hotels" -$env:AZURE_DOCUMENTDB_INDEX_NAME="vectorIndex" -$env:USE_PASSWORDLESS="true" -go run ./cmd/agent/main.go -``` diff --git a/.github/instructions/agent-samples.instructions.md b/.github/instructions/agent-samples.instructions.md new file mode 100644 index 0000000..5c94b42 --- /dev/null +++ b/.github/instructions/agent-samples.instructions.md @@ -0,0 +1,89 @@ +--- +applyTo: "ai/vector-search-agent-*/**" +--- +# Agent Samples (Multi-LLM Convention) + +Agent samples (`vector-search-agent-*`) use a **different convention** from quickstart samples. They orchestrate multiple LLM deployments and use a distinct set of environment variables. Do NOT mix agent conventions with quickstart conventions. + +## Architecture: Planner → Synthesizer + +Agent samples use a two-agent pipeline with three Azure OpenAI deployments: + +| Deployment | Role | Temperature | Purpose | +|------------|------|-------------|---------| +| Embedding | Vector search | — | Same as quickstart samples | +| Planner | Tool-calling agent | 0.0 | Transforms user query → tool call → retrieves search results | +| Synthesizer | Response generation | 0.3 | Takes search results + query → produces natural language recommendation | + +The planner invokes a `search_hotels_collection` tool that performs the vector search. The synthesizer receives the search results and generates a comparative hotel recommendation. + +## Agent Entry Points + +Agent samples have three separate entry points (not a single main file): + +| Entry Point | Purpose | +|-------------|---------| +| `upload` | Load hotel data, create embeddings, insert into DocumentDB, create vector index | +| `agent` | Run planner → synthesizer pipeline against an existing collection | +| `cleanup` | Drop the database | + +## Agent Environment Variables + +Agent samples use `AZURE_DOCUMENTDB_*` and `AZURE_OPENAI_*` prefixes consistently. These differ from quickstart variable names. + +| Agent Variable | Quickstart Equivalent | Notes | +|---------------|----------------------|-------| +| `AZURE_OPENAI_ENDPOINT` | `AZURE_OPENAI_EMBEDDING_ENDPOINT` | Single endpoint for all 3 deployments | +| `AZURE_OPENAI_API_KEY` | — | For API key auth (not used in quickstarts) | +| `AZURE_DOCUMENTDB_CLUSTER` | `MONGO_CLUSTER_NAME` | Cluster name for passwordless auth | +| `AZURE_DOCUMENTDB_CONNECTION_STRING` | `MONGO_CONNECTION_STRING` | Full connection string | +| `AZURE_DOCUMENTDB_COLLECTION` | — | Collection name (agents parameterize this) | +| `AZURE_DOCUMENTDB_INDEX_NAME` | — | Vector index name (agents parameterize this) | +| `VECTOR_INDEX_ALGORITHM` | `ALGORITHM` | Default: `vector-ivf` | +| `VECTOR_SIMILARITY` | `SIMILARITY` | Default: `COS` | +| `USE_PASSWORDLESS` | — | `true`/`false` toggle for auth mode | +| `DEBUG` | — | `true`/`false` verbose logging | +| `QUERY` | — | Default: `"quintessential lodging near running trails, eateries, retail"` | +| `NEAREST_NEIGHBORS` | — | Default: `5` | + +**Agent-only variables (no quickstart equivalent):** + +| Variable | Default | Purpose | +|----------|---------|---------| +| `AZURE_OPENAI_EMBEDDING_DEPLOYMENT` | (required) | Embedding model deployment name | +| `AZURE_OPENAI_EMBEDDING_API_VERSION` | 2024-06-01 (Go), 2023-05-15 (TS) | Embedding API version | +| `AZURE_OPENAI_PLANNER_DEPLOYMENT` / `AZURE_OPENAI_PLANNER_MODEL` | (required) | Planner LLM deployment | +| `AZURE_OPENAI_PLANNER_API_VERSION` | (required) | Planner API version | +| `AZURE_OPENAI_SYNTH_DEPLOYMENT` / `AZURE_OPENAI_SYNTH_MODEL` | (required) | Synthesizer LLM deployment | +| `AZURE_OPENAI_SYNTH_API_VERSION` | (required) | Synthesizer API version | +| `IVF_NUM_LISTS` | 10 | IVF numLists (⚠️ differs from quickstart default of 1) | +| `HNSW_M` | 16 | HNSW m parameter | +| `HNSW_EF_CONSTRUCTION` | 64 | HNSW efConstruction parameter | +| `DISKANN_MAX_DEGREE` | 20 | DiskANN maxDegree parameter | +| `DISKANN_L_BUILD` | 10 | DiskANN lBuild parameter | + +## Agent Authentication + +Agents support passwordless (OIDC) and API key auth, toggled by `USE_PASSWORDLESS`. + +**OIDC scopes:** +- DocumentDB: `https://ossrdbms-aad.database.windows.net/.default` +- Azure OpenAI: `https://cognitiveservices.azure.com/.default` + +**MongoDB URI (passwordless):** `mongodb+srv://{cluster}.global.mongocluster.cosmos.azure.com/` +- Auth mechanism: `MONGODB-OIDC` with machine callback + +## Language-Specific SDK Stacks + +| Language | MongoDB | OpenAI | Agent Framework | +|----------|---------|--------|-----------------| +| Go | `go.mongodb.org/mongo-driver` (raw) | `github.com/openai/openai-go/v3` (raw) | Manual tool-calling loop | +| TypeScript | `mongodb` (cleanup only) | `@langchain/openai` | `langchain` + `@langchain/azure-cosmosdb` + `zod` | + +**TypeScript agents use LangChain** — the `@langchain/azure-cosmosdb` package manages the vector store, and `langchain`'s `createAgent` handles tool orchestration. This is a fundamentally different SDK stack from the quickstart TypeScript samples which use the raw MongoDB driver. + +**Go agents use raw SDKs** — both MongoDB driver and OpenAI SDK are used directly, with manual tool-calling implementation. + +## IVF numLists Discrepancy + +Agent samples default to `IVF_NUM_LISTS=10`. Quickstart samples (vector-search, select-algorithm) hardcode `numLists=1`. This is intentional — agent samples are designed for tunable, production-like configurations while quickstart samples use minimal values for simplicity. diff --git a/.github/instructions/cli-examples.instructions.md b/.github/instructions/cli-examples.instructions.md new file mode 100644 index 0000000..a85c2e9 --- /dev/null +++ b/.github/instructions/cli-examples.instructions.md @@ -0,0 +1,136 @@ +--- +applyTo: "ai/**" +--- +# Running Samples — CLI Invocation + +Environment variables are passed inline with the run command. Do NOT use `.env` files. Each example below shows the required variables for a vector-search quickstart sample. + +## Go + +**Bash:** +```bash +MONGO_CLUSTER_NAME=myCluster \ +AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ +go run ./src/ivf.go +``` + +**PowerShell:** +```powershell +$env:MONGO_CLUSTER_NAME="myCluster" +$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" +go run ./src/ivf.go +``` + +## Python + +**Bash:** +```bash +MONGO_CLUSTER_NAME=myCluster \ +AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ +python src/ivf.py +``` + +**PowerShell:** +```powershell +$env:MONGO_CLUSTER_NAME="myCluster" +$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" +python src/ivf.py +``` + +## TypeScript/Node.js + +**Bash:** +```bash +MONGO_CLUSTER_NAME=myCluster \ +AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ +npx tsx src/ivf.ts +``` + +**PowerShell:** +```powershell +$env:MONGO_CLUSTER_NAME="myCluster" +$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" +npx tsx src/ivf.ts +``` + +## Java + +**Bash:** +```bash +MONGO_CLUSTER_NAME=myCluster \ +AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ +mvn compile exec:java -Dexec.mainClass="com.azure.documentdb.sample.IVF" +``` + +**PowerShell:** +```powershell +$env:MONGO_CLUSTER_NAME="myCluster" +$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" +mvn compile exec:java -Dexec.mainClass="com.azure.documentdb.sample.IVF" +``` + +## .NET + +.NET uses `appsettings.json` for configuration, but environment variables can override: + +**Bash:** +```bash +DocumentDB__ClusterName=myCluster \ +AzureOpenAI__Endpoint=https://myendpoint.openai.azure.com/ \ +AzureOpenAI__DeploymentName=text-embedding-ada-002 \ +dotnet run +``` + +**PowerShell:** +```powershell +$env:DocumentDB__ClusterName="myCluster" +$env:AzureOpenAI__Endpoint="https://myendpoint.openai.azure.com/" +$env:AzureOpenAI__DeploymentName="text-embedding-ada-002" +dotnet run +``` + +## Agent Samples (Multi-LLM) + +Agent samples require more variables for the planner and synthesizer deployments: + +**Bash:** +```bash +AZURE_OPENAI_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002 \ +AZURE_OPENAI_EMBEDDING_API_VERSION=2024-06-01 \ +AZURE_OPENAI_PLANNER_DEPLOYMENT=gpt-4o \ +AZURE_OPENAI_PLANNER_API_VERSION=2024-06-01 \ +AZURE_OPENAI_SYNTH_DEPLOYMENT=gpt-4o \ +AZURE_OPENAI_SYNTH_API_VERSION=2024-06-01 \ +AZURE_DOCUMENTDB_CLUSTER=myCluster \ +AZURE_DOCUMENTDB_DATABASENAME=Hotels \ +AZURE_DOCUMENTDB_COLLECTION=hotels \ +AZURE_DOCUMENTDB_INDEX_NAME=vectorIndex \ +USE_PASSWORDLESS=true \ +go run ./cmd/agent/main.go +``` + +**PowerShell:** +```powershell +$env:AZURE_OPENAI_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_DEPLOYMENT="text-embedding-ada-002" +$env:AZURE_OPENAI_EMBEDDING_API_VERSION="2024-06-01" +$env:AZURE_OPENAI_PLANNER_DEPLOYMENT="gpt-4o" +$env:AZURE_OPENAI_PLANNER_API_VERSION="2024-06-01" +$env:AZURE_OPENAI_SYNTH_DEPLOYMENT="gpt-4o" +$env:AZURE_OPENAI_SYNTH_API_VERSION="2024-06-01" +$env:AZURE_DOCUMENTDB_CLUSTER="myCluster" +$env:AZURE_DOCUMENTDB_DATABASENAME="Hotels" +$env:AZURE_DOCUMENTDB_COLLECTION="hotels" +$env:AZURE_DOCUMENTDB_INDEX_NAME="vectorIndex" +$env:USE_PASSWORDLESS="true" +go run ./cmd/agent/main.go +``` From 28db9284d6903623c965d2afd5e68fc54b676925 Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 8 May 2026 10:27:39 -0700 Subject: [PATCH 7/8] refactor: extract auth & execution patterns to scoped instruction file MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - copilot-instructions.md now 107 lines (from 377 original) - New: .github/instructions/execution-patterns.instructions.md Scoped to ai/vector-search-*/** — covers auth modes, lifecycle, naming conventions, search query, pipeline structure Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/copilot-instructions.md | 49 ----------------- .../execution-patterns.instructions.md | 53 +++++++++++++++++++ 2 files changed, 53 insertions(+), 49 deletions(-) create mode 100644 .github/instructions/execution-patterns.instructions.md diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index 732fca5..e4abf6e 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -93,55 +93,6 @@ All samples MUST use these environment variable names and defaults: - lBuild: 10 - lSearch: 40 -## Authentication - -All samples support two authentication modes. **Passwordless (OIDC) is preferred.** - -### Passwordless Authentication (Recommended) -- Uses `DefaultAzureCredential` / OIDC with `MONGO_CLUSTER_NAME` -- Connection URI format: `mongodb+srv://{clusterName}.global.mongocluster.cosmos.azure.com/` -- OIDC token scope: `https://ossrdbms-aad.database.windows.net/.default` -- Each language implements a utility function pair: `getClients()` and `getClientsPasswordless()` - -### Connection String Authentication -- Uses `MONGO_CONNECTION_STRING` with username/password -- Format: `mongodb+srv://username:password@{cluster}.mongocluster.cosmos.azure.com/?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000` - -> **Note:** `mongocluster.cosmos.azure.com` is the current DocumentDB hostname — this is NOT a Cosmos DB reference. - -## Sample Execution Pattern - -All vector search samples follow this consistent lifecycle: - -1. **Initialize clients** — Create MongoDB and Azure OpenAI clients (passwordless preferred) -2. **Drop collection** — Drop the algorithm-specific collection if it exists (clean start) -3. **Create collection** — Create a fresh collection -4. **Load data** — Read `Hotels_Vector.json` and batch-insert documents -5. **Create vector index** — Create algorithm-specific vector index using `createIndexes` command with `cosmosSearch` key type -6. **Generate query embedding** — Embed the search query text using Azure OpenAI -7. **Perform vector search** — Run `$search` aggregation pipeline with `cosmosSearch` operator -8. **Print results** — Display `HotelName` and `score` for top results -9. **Cleanup** — Drop the collection in a `finally`/`defer` block - -### Naming Conventions -- **Collection names:** `hotels_{algorithm}` — e.g., `hotels_ivf`, `hotels_hnsw`, `hotels_diskann` -- **Index names:** `vectorIndex_{algorithm}` — e.g., `vectorIndex_ivf`, `vectorIndex_hnsw`, `vectorIndex_diskann` -- **Database name:** `Hotels` (hardcoded, matches `AZURE_DOCUMENTDB_DATABASENAME` default) - -### Standard Search Query -All samples use the same query text: `"quintessential lodging near running trails, eateries, retail"` - -### Vector Search Pipeline Structure -All languages use the same aggregation pipeline structure: -``` -[ - { "$search": { "cosmosSearch": { "vector": , "path": , "k": 5 } } }, - { "$project": { "score": { "$meta": "searchScore" }, "document": "$$ROOT" } } -] -``` - -> **Note:** `cosmosSearch` is a valid MongoDB API command name for DocumentDB — this is NOT a Cosmos DB reference. - ## Rules 1. **No Cosmos DB references.**Never use "Cosmos DB", "cosmosdb", "MongoDB vCore", or "mongo.cosmos.azure.com". Always use "Azure DocumentDB" and "documentdb.azure.com". Exception: `mongocluster.cosmos.azure.com` (hostname), `cosmosSearch` (API command), and `ms-azuretools.vscode-cosmosdb` (VS Code extension) are valid and NOT Cosmos references. diff --git a/.github/instructions/execution-patterns.instructions.md b/.github/instructions/execution-patterns.instructions.md new file mode 100644 index 0000000..91b1b83 --- /dev/null +++ b/.github/instructions/execution-patterns.instructions.md @@ -0,0 +1,53 @@ +--- +applyTo: "ai/vector-search-*/**" +--- +# Sample Execution Patterns + +## Authentication + +All samples support two authentication modes. **Passwordless (OIDC) is preferred.** + +### Passwordless Authentication (Recommended) +- Uses `DefaultAzureCredential` / OIDC with `MONGO_CLUSTER_NAME` +- Connection URI format: `mongodb+srv://{clusterName}.global.mongocluster.cosmos.azure.com/` +- OIDC token scope: `https://ossrdbms-aad.database.windows.net/.default` +- Each language implements a utility function pair: `getClients()` and `getClientsPasswordless()` + +### Connection String Authentication +- Uses `MONGO_CONNECTION_STRING` with username/password +- Format: `mongodb+srv://username:password@{cluster}.mongocluster.cosmos.azure.com/?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000` + +> **Note:** `mongocluster.cosmos.azure.com` is the current DocumentDB hostname — this is NOT a Cosmos DB reference. + +## Sample Execution Pattern + +All vector search samples follow this consistent lifecycle: + +1. **Initialize clients** — Create MongoDB and Azure OpenAI clients (passwordless preferred) +2. **Drop collection** — Drop the algorithm-specific collection if it exists (clean start) +3. **Create collection** — Create a fresh collection +4. **Load data** — Read `Hotels_Vector.json` and batch-insert documents +5. **Create vector index** — Create algorithm-specific vector index using `createIndexes` command with `cosmosSearch` key type +6. **Generate query embedding** — Embed the search query text using Azure OpenAI +7. **Perform vector search** — Run `$search` aggregation pipeline with `cosmosSearch` operator +8. **Print results** — Display `HotelName` and `score` for top results +9. **Cleanup** — Drop the collection in a `finally`/`defer` block + +### Naming Conventions +- **Collection names:** `hotels_{algorithm}` — e.g., `hotels_ivf`, `hotels_hnsw`, `hotels_diskann` +- **Index names:** `vectorIndex_{algorithm}` — e.g., `vectorIndex_ivf`, `vectorIndex_hnsw`, `vectorIndex_diskann` +- **Database name:** `Hotels` (hardcoded, matches `AZURE_DOCUMENTDB_DATABASENAME` default) + +### Standard Search Query +All samples use the same query text: `"quintessential lodging near running trails, eateries, retail"` + +### Vector Search Pipeline Structure +All languages use the same aggregation pipeline structure: +``` +[ + { "$search": { "cosmosSearch": { "vector": , "path": , "k": 5 } } }, + { "$project": { "score": { "$meta": "searchScore" }, "document": "$$ROOT" } } +] +``` + +> **Note:** `cosmosSearch` is a valid MongoDB API command name for DocumentDB — this is NOT a Cosmos DB reference. From 0bd759663d7cda43ce861de0b0d873d0712d0cde Mon Sep 17 00:00:00 2001 From: "Dina Berry (She/her)" Date: Fri, 8 May 2026 10:35:57 -0700 Subject: [PATCH 8/8] =?UTF-8?q?fix:=20address=20review=20feedback=20?= =?UTF-8?q?=E2=80=94=20naming,=20scoping,=20clarity?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add Sample Categories section (quickstart vs agent distinction) - Add IVF numLists footnote (quickstart=1, agents=10 intentional) - Clarify .NET appsettings.json + env var override pattern - Add Rule 11: collection naming convention (hotels_{algorithm}) - Add Rule 12: k=5 for vector search results - Make DescriptionVector explicit in pipeline template - Add note that CLI examples apply to all 3 algorithms Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/copilot-instructions.md | 12 +++++++++--- .github/instructions/cli-examples.instructions.md | 2 ++ .../instructions/execution-patterns.instructions.md | 2 +- 3 files changed, 12 insertions(+), 4 deletions(-) diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md index e4abf6e..779dff3 100644 --- a/.github/copilot-instructions.md +++ b/.github/copilot-instructions.md @@ -17,7 +17,11 @@ ai/ └── vector-search-agent-typescript/ # TypeScript agent sample (separate from quickstart) ``` -Each vector-search sample directory contains: +### Sample Categories +- **Quickstart samples** (`vector-search-{language}/`): Single algorithm per file, one entry point, uses `MONGO_CLUSTER_NAME` + quickstart env vars +- **Agent samples** (`vector-search-agent-{language}/`): Multi-LLM orchestration, three entry points (upload/agent/cleanup), uses `AZURE_DOCUMENTDB_*` env vars + +Each quickstart sample directory contains: - `src/` — Source files: one per algorithm (`ivf`, `hnsw`, `diskann`) + `utils` + `create_embeddings` + `show_indexes` - `output/` — Expected output files: `ivf.txt`, `hnsw.txt`, `diskann.txt` - `README.md` — Setup, usage, and troubleshooting documentation @@ -80,7 +84,7 @@ All samples MUST use these environment variable names and defaults: ## Consistent Algorithm Parameters ### IVF -- numLists: 1 +- numLists: 1 *(quickstart samples; agent samples use `IVF_NUM_LISTS=10` for production-like config)* - nProbes: 1 ### HNSW @@ -100,8 +104,10 @@ All samples MUST use these environment variable names and defaults: 3. **Data file path from env var.** Code reads `DATA_FILE_WITH_VECTORS` which defaults to `../data/Hotels_Vector.json` (the shared data location). .NET copies data locally to `data/Hotels_Vector.json` in the build output. 4. **Batch size is LOAD_SIZE_BATCH=100.** Do not use BATCH_SIZE or other variants. 5. **Database name variable is AZURE_DOCUMENTDB_DATABASENAME.** Do not use MONGO_DB_NAME or other variants. -6. **.NET uses appsettings.json** with configuration sections: `AzureOpenAI`, `DataFiles`, `Embedding`, `MongoDB`, `VectorSearch`. +6. **.NET uses appsettings.json** with configuration sections: `AzureOpenAI`, `DataFiles`, `Embedding`, `MongoDB`, `VectorSearch`. Environment variables override config using `Section__Key` format (e.g., `AzureOpenAI__Endpoint`). 7. **Similarity metric is COS.** All vector index definitions use `"similarity": "COS"` (cosine similarity). 8. **Output files are committed.** Each sample has an `output/` directory with expected output for each algorithm (`ivf.txt`, `hnsw.txt`, `diskann.txt`). Update these when output format changes. 9. **DocumentDB supports all index types at any dataset size.** IVF, HNSW, and DiskANN are all available — do not imply tier restrictions limit algorithm availability. 10. **No dotenv libraries.** Do NOT use `python-dotenv`, `godotenv`, `dotenv` (npm), or any `.env` file-loading library. Environment variables must be passed via the CLI invocation, not loaded from `.env` files at runtime. This keeps samples explicit and avoids hidden configuration. +11. **Collection naming:** `hotels_{algorithm}` (e.g., `hotels_ivf`, `hotels_hnsw`, `hotels_diskann`). Index naming: `vectorIndex_{algorithm}`. +12. **Vector search uses k=5.** All samples return top 5 results. Do not parameterize k unless explicitly required. diff --git a/.github/instructions/cli-examples.instructions.md b/.github/instructions/cli-examples.instructions.md index a85c2e9..678fba3 100644 --- a/.github/instructions/cli-examples.instructions.md +++ b/.github/instructions/cli-examples.instructions.md @@ -5,6 +5,8 @@ applyTo: "ai/**" Environment variables are passed inline with the run command. Do NOT use `.env` files. Each example below shows the required variables for a vector-search quickstart sample. +> **Note:** Examples show `ivf` but the same pattern applies to all algorithms — replace `ivf` with `hnsw` or `diskann` in file/class names. + ## Go **Bash:** diff --git a/.github/instructions/execution-patterns.instructions.md b/.github/instructions/execution-patterns.instructions.md index 91b1b83..d97db20 100644 --- a/.github/instructions/execution-patterns.instructions.md +++ b/.github/instructions/execution-patterns.instructions.md @@ -45,7 +45,7 @@ All samples use the same query text: `"quintessential lodging near running trail All languages use the same aggregation pipeline structure: ``` [ - { "$search": { "cosmosSearch": { "vector": , "path": , "k": 5 } } }, + { "$search": { "cosmosSearch": { "vector": , "path": "DescriptionVector", "k": 5 } } }, { "$project": { "score": { "$meta": "searchScore" }, "document": "$$ROOT" } } ] ```