Skip to content

Add incremental embeddings mode to reduce costs#737

Open
mishig25 wants to merge 16 commits intomainfrom
incremental-embeddings
Open

Add incremental embeddings mode to reduce costs#737
mishig25 wants to merge 16 commits intomainfrom
incremental-embeddings

Conversation

@mishig25
Copy link
Copy Markdown
Contributor

@mishig25 mishig25 commented Jan 27, 2026

Summary

  • Adds incremental mode to populate-search-engine and add-gradio-docs commands
  • Only processes new/changed documents by tracking document IDs in a HuggingFace dataset
  • Reduces embedding inference costs by avoiding reprocessing unchanged content
  • Workflow now defaults to incremental mode, with option for full rebuild when needed

How it works

Document IDs are generated deterministically: {library}-{page}-{sha256_hash_of_text[:8]}

In incremental mode:

  1. Fetches existing document IDs from hf-doc-build/doc-builder-embeddings-tracker dataset
  2. Reconstructs IDs for all current chunks from source docs (fast - no embedding needed)
  3. Compares to find new/changed chunks (different hash = different content)
  4. Generates embeddings only for new chunks
  5. Uploads directly to main Meilisearch index (upsert behavior)
  6. Updates tracker dataset with new IDs

Files

New files:

  • migrations/init_embeddings_tracker.py - Initializes tracker dataset by reconstructing IDs from hf-doc-build/doc-build (no Meilisearch needed)
  • src/doc_builder/embeddings_tracker.py - Helper functions for tracking document IDs

Modified files:

  • src/doc_builder/commands/embeddings.py - Added --incremental and --hf_token flags to both commands
  • src/doc_builder/build_embeddings.py - Updated add_gradio_docs to support incremental mode
  • .github/workflows/populate_search_engine.yml - Default to incremental mode, with full_rebuild input option

Usage

Initialize tracker (one-time setup - already done):

uv run python migrations/init_embeddings_tracker.py --hf_token <token>

Incremental update (default in workflow):

uv run doc-builder populate-search-engine --incremental
uv run doc-builder add-gradio-docs --incremental

Full rebuild (when needed):

uv run doc-builder populate-search-engine
uv run doc-builder add-gradio-docs

Or trigger workflow with full_rebuild: true.

Setup required

  • HF_TOKEN secret needs to be added to the repo for updating the tracker dataset

Test plan

  • Run migration script to create initial tracker dataset (48,401 document IDs)
  • Run workflow in incremental mode - verify it skips existing docs
  • Modify a doc and run again - verify only the changed doc is processed
  • Run with full_rebuild=true - verify full rebuild still works

Currently the search engine population workflow recreates all embeddings
every run, which is expensive. This adds an incremental mode that:

- Tracks document IDs in a HuggingFace dataset (huggingface/doc-builder-embeddings-tracker)
- Checks existing IDs before processing and skips already-embedded docs
- Uploads directly to main index in incremental mode (no swap needed)
- Falls back to full rebuild when triggered manually with full_rebuild=true

Files added:
- migrations/create_embeddings_dataset.py: One-time migration to init tracker dataset
- src/doc_builder/embeddings_tracker.py: Helper functions for tracking

Files modified:
- src/doc_builder/commands/embeddings.py: Added --incremental and --hf_token flags
- .github/workflows/populate_search_engine.yml: Default to incremental mode
The add-gradio-docs command now also supports --incremental flag to only
process new/changed Gradio documentation, consistent with the main
populate-search-engine command.
Changed the dataset creation path from 'huggingface/doc-builder-embeddings-tracker' to 'hf-doc-build/doc-builder-embeddings-tracker' for consistency with the new repository structure.
@mishig25 mishig25 force-pushed the incremental-embeddings branch from d373cfa to eda4f6d Compare January 27, 2026 12:37
When a page is updated, the old version's document ID (with old content hash)
remains in Meilisearch. This change:

- Adds find_stale_ids() to identify old IDs that should be removed
- Adds delete_documents_by_ids() to remove specific documents from Meilisearch
- Updates incremental mode to delete stale entries before updating tracker
- Tracker dataset now also removes stale IDs when updating
@mishig25 mishig25 force-pushed the incremental-embeddings branch from 1338a52 to fa960ac Compare January 27, 2026 12:43
@mishig25 mishig25 force-pushed the incremental-embeddings branch from c20d10b to 83baf82 Compare January 27, 2026 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant