diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 0000000..779dff3 --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,113 @@ +# DocumentDB Samples — Copilot Instructions + +## Project Overview +Azure DocumentDB code samples for vector search and algorithm selection quickstart articles. + +## Repository Structure + +``` +ai/ +├── data/ # Shared data files (Hotels.json, Hotels_Vector.json) +├── vector-search-python/ # Python vector search samples +├── vector-search-typescript/ # TypeScript/Node.js vector search samples +├── vector-search-go/ # Go vector search samples +├── vector-search-java/ # Java vector search samples +├── vector-search-dotnet/ # .NET vector search samples +├── vector-search-agent-go/ # Go agent sample (separate from quickstart) +└── vector-search-agent-typescript/ # TypeScript agent sample (separate from quickstart) +``` + +### Sample Categories +- **Quickstart samples** (`vector-search-{language}/`): Single algorithm per file, one entry point, uses `MONGO_CLUSTER_NAME` + quickstart env vars +- **Agent samples** (`vector-search-agent-{language}/`): Multi-LLM orchestration, three entry points (upload/agent/cleanup), uses `AZURE_DOCUMENTDB_*` env vars + +Each quickstart sample directory contains: +- `src/` — Source files: one per algorithm (`ivf`, `hnsw`, `diskann`) + `utils` + `create_embeddings` + `show_indexes` +- `output/` — Expected output files: `ivf.txt`, `hnsw.txt`, `diskann.txt` +- `README.md` — Setup, usage, and troubleshooting documentation +- `.env.example` (Go, Python, TypeScript) or `appsettings.json` (.NET) — Configuration template + +## Language Dependencies + +### Go +- Go 1.21+ +- go.mongodb.org/mongo-driver v1.17+ +- github.com/Azure/azure-sdk-for-go/sdk/azidentity +- github.com/Azure/azure-sdk-for-go/sdk/azcore +- github.com/openai/openai-go/v3 + +### Java +- Java 17+ +- MongoDB Driver (mongodb-driver-sync) 5.3+ +- Azure Identity (azure-identity) 1.15+ +- Azure AI OpenAI (azure-ai-openai) +- Maven 3.8+ + +### Python +- Python 3.10+ +- pymongo >= 4.7 +- azure-identity +- openai + +### TypeScript/Node.js +- Node.js 20+ +- mongodb 6.12+ +- @azure/identity +- openai + +### .NET +- .NET 8+ +- MongoDB.Driver 3.0+ +- Azure.Identity +- Azure.AI.OpenAI + +## Consistent Variable Values + +All samples MUST use these environment variable names and defaults: + +| Variable | Default | Purpose | +|----------|---------|---------| +| MONGO_CLUSTER_NAME | (required) | DocumentDB cluster name (passwordless auth) | +| MONGO_CONNECTION_STRING | (none) | Full connection string (connection string auth) | +| AZURE_OPENAI_EMBEDDING_ENDPOINT | (required) | Azure OpenAI endpoint | +| AZURE_OPENAI_EMBEDDING_MODEL | (required) | Embedding model deployment name | +| AZURE_OPENAI_EMBEDDING_API_VERSION | 2023-05-15 | Azure OpenAI API version | +| DATA_FILE_WITH_VECTORS | ../data/Hotels_Vector.json | Path to data file with embeddings | +| EMBEDDED_FIELD | DescriptionVector | Vector field name in documents | +| EMBEDDING_DIMENSIONS | 1536 | Vector dimensions | +| LOAD_SIZE_BATCH | 100 | Batch size for document insertion | +| EMBEDDING_SIZE_BATCH | 16 | Batch size for embedding generation | +| AZURE_DOCUMENTDB_DATABASENAME | Hotels | Database name | +| SIMILARITY | (varies) | Similarity metric (COS, euclidean, ip) | +| ALGORITHM | (varies) | Algorithm (ivf, hnsw, diskann) | + +## Consistent Algorithm Parameters + +### IVF +- numLists: 1 *(quickstart samples; agent samples use `IVF_NUM_LISTS=10` for production-like config)* +- nProbes: 1 + +### HNSW +- m: 16 +- efConstruction: 64 +- efSearch: 40 + +### DiskANN +- maxDegree: 20 +- lBuild: 10 +- lSearch: 40 + +## Rules + +1. **No Cosmos DB references.**Never use "Cosmos DB", "cosmosdb", "MongoDB vCore", or "mongo.cosmos.azure.com". Always use "Azure DocumentDB" and "documentdb.azure.com". Exception: `mongocluster.cosmos.azure.com` (hostname), `cosmosSearch` (API command), and `ms-azuretools.vscode-cosmosdb` (VS Code extension) are valid and NOT Cosmos references. +2. **Vector field name is DescriptionVector.** Never default to "contentVector". +3. **Data file path from env var.** Code reads `DATA_FILE_WITH_VECTORS` which defaults to `../data/Hotels_Vector.json` (the shared data location). .NET copies data locally to `data/Hotels_Vector.json` in the build output. +4. **Batch size is LOAD_SIZE_BATCH=100.** Do not use BATCH_SIZE or other variants. +5. **Database name variable is AZURE_DOCUMENTDB_DATABASENAME.** Do not use MONGO_DB_NAME or other variants. +6. **.NET uses appsettings.json** with configuration sections: `AzureOpenAI`, `DataFiles`, `Embedding`, `MongoDB`, `VectorSearch`. Environment variables override config using `Section__Key` format (e.g., `AzureOpenAI__Endpoint`). +7. **Similarity metric is COS.** All vector index definitions use `"similarity": "COS"` (cosine similarity). +8. **Output files are committed.** Each sample has an `output/` directory with expected output for each algorithm (`ivf.txt`, `hnsw.txt`, `diskann.txt`). Update these when output format changes. +9. **DocumentDB supports all index types at any dataset size.** IVF, HNSW, and DiskANN are all available — do not imply tier restrictions limit algorithm availability. +10. **No dotenv libraries.** Do NOT use `python-dotenv`, `godotenv`, `dotenv` (npm), or any `.env` file-loading library. Environment variables must be passed via the CLI invocation, not loaded from `.env` files at runtime. This keeps samples explicit and avoids hidden configuration. +11. **Collection naming:** `hotels_{algorithm}` (e.g., `hotels_ivf`, `hotels_hnsw`, `hotels_diskann`). Index naming: `vectorIndex_{algorithm}`. +12. **Vector search uses k=5.** All samples return top 5 results. Do not parameterize k unless explicitly required. diff --git a/.github/instructions/agent-samples.instructions.md b/.github/instructions/agent-samples.instructions.md new file mode 100644 index 0000000..5c94b42 --- /dev/null +++ b/.github/instructions/agent-samples.instructions.md @@ -0,0 +1,89 @@ +--- +applyTo: "ai/vector-search-agent-*/**" +--- +# Agent Samples (Multi-LLM Convention) + +Agent samples (`vector-search-agent-*`) use a **different convention** from quickstart samples. They orchestrate multiple LLM deployments and use a distinct set of environment variables. Do NOT mix agent conventions with quickstart conventions. + +## Architecture: Planner → Synthesizer + +Agent samples use a two-agent pipeline with three Azure OpenAI deployments: + +| Deployment | Role | Temperature | Purpose | +|------------|------|-------------|---------| +| Embedding | Vector search | — | Same as quickstart samples | +| Planner | Tool-calling agent | 0.0 | Transforms user query → tool call → retrieves search results | +| Synthesizer | Response generation | 0.3 | Takes search results + query → produces natural language recommendation | + +The planner invokes a `search_hotels_collection` tool that performs the vector search. The synthesizer receives the search results and generates a comparative hotel recommendation. + +## Agent Entry Points + +Agent samples have three separate entry points (not a single main file): + +| Entry Point | Purpose | +|-------------|---------| +| `upload` | Load hotel data, create embeddings, insert into DocumentDB, create vector index | +| `agent` | Run planner → synthesizer pipeline against an existing collection | +| `cleanup` | Drop the database | + +## Agent Environment Variables + +Agent samples use `AZURE_DOCUMENTDB_*` and `AZURE_OPENAI_*` prefixes consistently. These differ from quickstart variable names. + +| Agent Variable | Quickstart Equivalent | Notes | +|---------------|----------------------|-------| +| `AZURE_OPENAI_ENDPOINT` | `AZURE_OPENAI_EMBEDDING_ENDPOINT` | Single endpoint for all 3 deployments | +| `AZURE_OPENAI_API_KEY` | — | For API key auth (not used in quickstarts) | +| `AZURE_DOCUMENTDB_CLUSTER` | `MONGO_CLUSTER_NAME` | Cluster name for passwordless auth | +| `AZURE_DOCUMENTDB_CONNECTION_STRING` | `MONGO_CONNECTION_STRING` | Full connection string | +| `AZURE_DOCUMENTDB_COLLECTION` | — | Collection name (agents parameterize this) | +| `AZURE_DOCUMENTDB_INDEX_NAME` | — | Vector index name (agents parameterize this) | +| `VECTOR_INDEX_ALGORITHM` | `ALGORITHM` | Default: `vector-ivf` | +| `VECTOR_SIMILARITY` | `SIMILARITY` | Default: `COS` | +| `USE_PASSWORDLESS` | — | `true`/`false` toggle for auth mode | +| `DEBUG` | — | `true`/`false` verbose logging | +| `QUERY` | — | Default: `"quintessential lodging near running trails, eateries, retail"` | +| `NEAREST_NEIGHBORS` | — | Default: `5` | + +**Agent-only variables (no quickstart equivalent):** + +| Variable | Default | Purpose | +|----------|---------|---------| +| `AZURE_OPENAI_EMBEDDING_DEPLOYMENT` | (required) | Embedding model deployment name | +| `AZURE_OPENAI_EMBEDDING_API_VERSION` | 2024-06-01 (Go), 2023-05-15 (TS) | Embedding API version | +| `AZURE_OPENAI_PLANNER_DEPLOYMENT` / `AZURE_OPENAI_PLANNER_MODEL` | (required) | Planner LLM deployment | +| `AZURE_OPENAI_PLANNER_API_VERSION` | (required) | Planner API version | +| `AZURE_OPENAI_SYNTH_DEPLOYMENT` / `AZURE_OPENAI_SYNTH_MODEL` | (required) | Synthesizer LLM deployment | +| `AZURE_OPENAI_SYNTH_API_VERSION` | (required) | Synthesizer API version | +| `IVF_NUM_LISTS` | 10 | IVF numLists (⚠️ differs from quickstart default of 1) | +| `HNSW_M` | 16 | HNSW m parameter | +| `HNSW_EF_CONSTRUCTION` | 64 | HNSW efConstruction parameter | +| `DISKANN_MAX_DEGREE` | 20 | DiskANN maxDegree parameter | +| `DISKANN_L_BUILD` | 10 | DiskANN lBuild parameter | + +## Agent Authentication + +Agents support passwordless (OIDC) and API key auth, toggled by `USE_PASSWORDLESS`. + +**OIDC scopes:** +- DocumentDB: `https://ossrdbms-aad.database.windows.net/.default` +- Azure OpenAI: `https://cognitiveservices.azure.com/.default` + +**MongoDB URI (passwordless):** `mongodb+srv://{cluster}.global.mongocluster.cosmos.azure.com/` +- Auth mechanism: `MONGODB-OIDC` with machine callback + +## Language-Specific SDK Stacks + +| Language | MongoDB | OpenAI | Agent Framework | +|----------|---------|--------|-----------------| +| Go | `go.mongodb.org/mongo-driver` (raw) | `github.com/openai/openai-go/v3` (raw) | Manual tool-calling loop | +| TypeScript | `mongodb` (cleanup only) | `@langchain/openai` | `langchain` + `@langchain/azure-cosmosdb` + `zod` | + +**TypeScript agents use LangChain** — the `@langchain/azure-cosmosdb` package manages the vector store, and `langchain`'s `createAgent` handles tool orchestration. This is a fundamentally different SDK stack from the quickstart TypeScript samples which use the raw MongoDB driver. + +**Go agents use raw SDKs** — both MongoDB driver and OpenAI SDK are used directly, with manual tool-calling implementation. + +## IVF numLists Discrepancy + +Agent samples default to `IVF_NUM_LISTS=10`. Quickstart samples (vector-search, select-algorithm) hardcode `numLists=1`. This is intentional — agent samples are designed for tunable, production-like configurations while quickstart samples use minimal values for simplicity. diff --git a/.github/instructions/cli-examples.instructions.md b/.github/instructions/cli-examples.instructions.md new file mode 100644 index 0000000..678fba3 --- /dev/null +++ b/.github/instructions/cli-examples.instructions.md @@ -0,0 +1,138 @@ +--- +applyTo: "ai/**" +--- +# Running Samples — CLI Invocation + +Environment variables are passed inline with the run command. Do NOT use `.env` files. Each example below shows the required variables for a vector-search quickstart sample. + +> **Note:** Examples show `ivf` but the same pattern applies to all algorithms — replace `ivf` with `hnsw` or `diskann` in file/class names. + +## Go + +**Bash:** +```bash +MONGO_CLUSTER_NAME=myCluster \ +AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ +go run ./src/ivf.go +``` + +**PowerShell:** +```powershell +$env:MONGO_CLUSTER_NAME="myCluster" +$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" +go run ./src/ivf.go +``` + +## Python + +**Bash:** +```bash +MONGO_CLUSTER_NAME=myCluster \ +AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ +python src/ivf.py +``` + +**PowerShell:** +```powershell +$env:MONGO_CLUSTER_NAME="myCluster" +$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" +python src/ivf.py +``` + +## TypeScript/Node.js + +**Bash:** +```bash +MONGO_CLUSTER_NAME=myCluster \ +AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ +npx tsx src/ivf.ts +``` + +**PowerShell:** +```powershell +$env:MONGO_CLUSTER_NAME="myCluster" +$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" +npx tsx src/ivf.ts +``` + +## Java + +**Bash:** +```bash +MONGO_CLUSTER_NAME=myCluster \ +AZURE_OPENAI_EMBEDDING_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_MODEL=text-embedding-ada-002 \ +mvn compile exec:java -Dexec.mainClass="com.azure.documentdb.sample.IVF" +``` + +**PowerShell:** +```powershell +$env:MONGO_CLUSTER_NAME="myCluster" +$env:AZURE_OPENAI_EMBEDDING_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-ada-002" +mvn compile exec:java -Dexec.mainClass="com.azure.documentdb.sample.IVF" +``` + +## .NET + +.NET uses `appsettings.json` for configuration, but environment variables can override: + +**Bash:** +```bash +DocumentDB__ClusterName=myCluster \ +AzureOpenAI__Endpoint=https://myendpoint.openai.azure.com/ \ +AzureOpenAI__DeploymentName=text-embedding-ada-002 \ +dotnet run +``` + +**PowerShell:** +```powershell +$env:DocumentDB__ClusterName="myCluster" +$env:AzureOpenAI__Endpoint="https://myendpoint.openai.azure.com/" +$env:AzureOpenAI__DeploymentName="text-embedding-ada-002" +dotnet run +``` + +## Agent Samples (Multi-LLM) + +Agent samples require more variables for the planner and synthesizer deployments: + +**Bash:** +```bash +AZURE_OPENAI_ENDPOINT=https://myendpoint.openai.azure.com/ \ +AZURE_OPENAI_EMBEDDING_DEPLOYMENT=text-embedding-ada-002 \ +AZURE_OPENAI_EMBEDDING_API_VERSION=2024-06-01 \ +AZURE_OPENAI_PLANNER_DEPLOYMENT=gpt-4o \ +AZURE_OPENAI_PLANNER_API_VERSION=2024-06-01 \ +AZURE_OPENAI_SYNTH_DEPLOYMENT=gpt-4o \ +AZURE_OPENAI_SYNTH_API_VERSION=2024-06-01 \ +AZURE_DOCUMENTDB_CLUSTER=myCluster \ +AZURE_DOCUMENTDB_DATABASENAME=Hotels \ +AZURE_DOCUMENTDB_COLLECTION=hotels \ +AZURE_DOCUMENTDB_INDEX_NAME=vectorIndex \ +USE_PASSWORDLESS=true \ +go run ./cmd/agent/main.go +``` + +**PowerShell:** +```powershell +$env:AZURE_OPENAI_ENDPOINT="https://myendpoint.openai.azure.com/" +$env:AZURE_OPENAI_EMBEDDING_DEPLOYMENT="text-embedding-ada-002" +$env:AZURE_OPENAI_EMBEDDING_API_VERSION="2024-06-01" +$env:AZURE_OPENAI_PLANNER_DEPLOYMENT="gpt-4o" +$env:AZURE_OPENAI_PLANNER_API_VERSION="2024-06-01" +$env:AZURE_OPENAI_SYNTH_DEPLOYMENT="gpt-4o" +$env:AZURE_OPENAI_SYNTH_API_VERSION="2024-06-01" +$env:AZURE_DOCUMENTDB_CLUSTER="myCluster" +$env:AZURE_DOCUMENTDB_DATABASENAME="Hotels" +$env:AZURE_DOCUMENTDB_COLLECTION="hotels" +$env:AZURE_DOCUMENTDB_INDEX_NAME="vectorIndex" +$env:USE_PASSWORDLESS="true" +go run ./cmd/agent/main.go +``` diff --git a/.github/instructions/execution-patterns.instructions.md b/.github/instructions/execution-patterns.instructions.md new file mode 100644 index 0000000..d97db20 --- /dev/null +++ b/.github/instructions/execution-patterns.instructions.md @@ -0,0 +1,53 @@ +--- +applyTo: "ai/vector-search-*/**" +--- +# Sample Execution Patterns + +## Authentication + +All samples support two authentication modes. **Passwordless (OIDC) is preferred.** + +### Passwordless Authentication (Recommended) +- Uses `DefaultAzureCredential` / OIDC with `MONGO_CLUSTER_NAME` +- Connection URI format: `mongodb+srv://{clusterName}.global.mongocluster.cosmos.azure.com/` +- OIDC token scope: `https://ossrdbms-aad.database.windows.net/.default` +- Each language implements a utility function pair: `getClients()` and `getClientsPasswordless()` + +### Connection String Authentication +- Uses `MONGO_CONNECTION_STRING` with username/password +- Format: `mongodb+srv://username:password@{cluster}.mongocluster.cosmos.azure.com/?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000` + +> **Note:** `mongocluster.cosmos.azure.com` is the current DocumentDB hostname — this is NOT a Cosmos DB reference. + +## Sample Execution Pattern + +All vector search samples follow this consistent lifecycle: + +1. **Initialize clients** — Create MongoDB and Azure OpenAI clients (passwordless preferred) +2. **Drop collection** — Drop the algorithm-specific collection if it exists (clean start) +3. **Create collection** — Create a fresh collection +4. **Load data** — Read `Hotels_Vector.json` and batch-insert documents +5. **Create vector index** — Create algorithm-specific vector index using `createIndexes` command with `cosmosSearch` key type +6. **Generate query embedding** — Embed the search query text using Azure OpenAI +7. **Perform vector search** — Run `$search` aggregation pipeline with `cosmosSearch` operator +8. **Print results** — Display `HotelName` and `score` for top results +9. **Cleanup** — Drop the collection in a `finally`/`defer` block + +### Naming Conventions +- **Collection names:** `hotels_{algorithm}` — e.g., `hotels_ivf`, `hotels_hnsw`, `hotels_diskann` +- **Index names:** `vectorIndex_{algorithm}` — e.g., `vectorIndex_ivf`, `vectorIndex_hnsw`, `vectorIndex_diskann` +- **Database name:** `Hotels` (hardcoded, matches `AZURE_DOCUMENTDB_DATABASENAME` default) + +### Standard Search Query +All samples use the same query text: `"quintessential lodging near running trails, eateries, retail"` + +### Vector Search Pipeline Structure +All languages use the same aggregation pipeline structure: +``` +[ + { "$search": { "cosmosSearch": { "vector": , "path": "DescriptionVector", "k": 5 } } }, + { "$project": { "score": { "$meta": "searchScore" }, "document": "$$ROOT" } } +] +``` + +> **Note:** `cosmosSearch` is a valid MongoDB API command name for DocumentDB — this is NOT a Cosmos DB reference.