diff --git a/.gitignore b/.gitignore index b2afe2a..84ab1c7 100644 --- a/.gitignore +++ b/.gitignore @@ -3,6 +3,8 @@ tmp/ node_modules/ dist/ data/ +!comparison/data/ + *~ .yarn diff --git a/CLAUDE-ai-provider-comparison.md b/CLAUDE-ai-provider-comparison.md index 0b38f84..5f4e750 100644 --- a/CLAUDE-ai-provider-comparison.md +++ b/CLAUDE-ai-provider-comparison.md @@ -15,8 +15,7 @@ This work will be done within the existing `expression-shepherd` repository, lev ### Access Details: - API endpoint: `/wdk-service/record-types/gene/searches/single_record_question_GeneRecordClasses_GeneRecordClass/reports/aiExpression` -- Proxy setup: Available for bmaccallum* sites (details TBD from user) -- Local development: Proxy endpoint is `http://localhost:8080/wdk-service/...` when using webpack local server +- Authentication: Cookie-based authentication using `auth_tkt` cookie (see Appendix for details) ## Gene Set - Format: VectorBase IDs (e.g., AGAP001234) @@ -25,33 +24,100 @@ This work will be done within the existing `expression-shepherd` repository, lev ## Workflow -### Phase 1: Trigger Summary Generation & Fetch Results -1. Read gene list from input file (format TBD by user) -2. For each gene, make API calls to all three sites: - - Initial request with `populateIfNotPresent: true` to trigger generation - - Poll with `populateIfNotPresent: false` until `resultStatus: "present"` - - See `useAiExpressionSummary` in [AiExpressionSummary.tsx](https://raw.githubusercontent.com/VEuPathDB/web-monorepo/refs/heads/main/packages/sites/genomics-site/webapp/wdkCustomization/js/client/components/records/AiExpressionSummary.tsx) for orchestration details +### Phase 0: Setup & Configuration + +**Gene List (`comparison/input/gene-list.txt`):** +- Format: One gene ID per line (e.g., `AGAP001234`) +- Optional: Add gene name/description after ID (space-separated): `AGAP001234 ABC1 transporter protein` +- Comments: Lines starting with `#` are ignored (useful for temporarily excluding genes) + +**Sites Configuration (`comparison/config/sites.json`):** +- **`analysis_model`**: Configures which AI model performs the comparison analysis (Phases 2-3) + - `model_string`: Full model identifier (e.g., `"claude-sonnet-4-20250514"` or `"gpt-4o-2024-11-20"`) + - `name`: Short name for output directory organization (e.g., `"claude4"`, `"gpt4o"`, `"gpt5"`) + - `platform`: Either `"anthropic"` or `"openai"` + - Note: This is independent of the models being compared (which are defined in `sites`) +- **`sites`**: Array of model endpoints to compare (VEuPathDB dev sites) + - Each site has: `name`, `hostname`, `appPath`, `model` (display name) + - Set `skip: true` to temporarily exclude a site from processing +- **`endpoint`**: VEuPathDB API endpoint path +- **`projectId`**: VEuPathDB project (e.g., `"VectorBase"`) + +### Phase 1: Fetch Summary Results +1. Read gene list from input file +2. For each gene, make API calls to all three sites with `populateIfNotPresent: true` + - Summaries are typically cached server-side from previous generations + - If fetch fails (summary not ready), script will report error at end + - Simply re-run `yarn comparison:fetch` to retry failed fetches 3. Save JSON responses locally, organized by gene ID and model 4. Track progress and any errors +**Runtime Expectations:** +- For 20 genes × 3 sites: typically completes quickly when summaries are cached +- First-time generation may take longer due to: + - AI generation time (varies by model and gene complexity) + - Backend rate limiting (especially for Anthropic/Claude) + - Sequential processing per gene, parallel across sites +- Failed fetches can be retried by re-running the script + ### Phase 2: AI-Powered Comparison 1. Create TypeScript scripts that use Anthropic API to compare summaries -2. Perform pairwise comparisons for each gene: - - Claude vs GPT-5 - - Claude vs GPT-4o - - GPT-5 vs GPT-4o +2. Perform bidirectional pairwise comparisons for each gene (to detect position bias): + - Claude vs GPT-5 (both directions) + - Claude vs GPT-4o (both directions) + - GPT-5 vs GPT-4o (both directions) 3. Comparison dimensions: - - Specific biological insights mentioned + - Specific biological observations and insights (with only_in_A, only_in_B, in_both categorization) - Tone and style - Level of technical detail - - Length and structure -4. Output: JSON format for each comparison + - Structure and organization + - Deterministic metrics (word count, topic count, etc.) + - Quantitative expression mentions (fold changes, TPM, percentiles) +4. Output: JSON format for each bidirectional comparison (6 files per gene) + +### Phase 2.5: Condensation +1. Merge bidirectional comparison pairs into condensed summaries +2. For each model pair: + - Calculate biological content averages (observations/insights counts only, not full text) + - Calculate quantitative mention averages + - Use AI to merge qualitative assessments and detect position bias + - Preserve deterministic metrics from both models +3. Output: Condensed JSON files (3 files per gene: claude4-gpt4o, claude4-gpt5, gpt4o-gpt5) +4. Result: ~95% reduction in data volume by replacing detailed observation lists with summary counts ### Phase 3: Aggregate Analysis -1. Collect all pairwise comparison JSONs -2. Feed into second-pass AI summarization -3. Identify themes and patterns across the gene set -4. Generate final report on systematic differences +1. For each model pair (claude4-gpt4o, claude4-gpt5, gpt4o-gpt5): + - Collect all 20 condensed comparison files for that pair + - Calculate aggregate statistics: + - Average unique observations/insights per model across all genes + - Average quantitative mentions per model + - Average deterministic metrics (word count, topic count, etc.) + - Position bias frequency (% of genes with AB/BA merge contradictions) + - Feed all qualitative assessments into AI for pattern identification: + - Consistent tone/style differences + - Consistent technical detail level differences + - Consistent organizational approach differences + - Any systematic contradictions or themes + - Generate model pair report with both quantitative stats and qualitative themes +2. Output: One aggregate report per model pair (3 reports total) +3. Model identities remain anonymous through aggregation; only revealed when saving final reports +4. Optional future step: Super-aggregation comparing all 3 pairwise reports + +### Phase 4: HTML Report Generation +1. Generate individual gene summary HTML pages for each gene × model combination + - Display gene headline, one-paragraph summary, and topics with expandable experiment details + - Sanitize AI-generated HTML to allow only safe formatting tags (``, ``, `