Add incremental embeddings mode to reduce costs by mishig25 · Pull Request #737 · huggingface/doc-builder

mishig25 · 2026-01-27T12:19:43Z

Summary

Adds incremental mode to populate-search-engine and add-gradio-docs commands
Only processes new/changed documents by tracking document IDs in a HuggingFace dataset
Reduces embedding inference costs by avoiding reprocessing unchanged content
Workflow now defaults to incremental mode, with option for full rebuild when needed

How it works

Document IDs are generated deterministically: {library}-{page}-{sha256_hash_of_text[:8]}

In incremental mode:

Fetches existing document IDs from hf-doc-build/doc-builder-embeddings-tracker dataset
Reconstructs IDs for all current chunks from source docs (fast - no embedding needed)
Compares to find new/changed chunks (different hash = different content)
Generates embeddings only for new chunks
Uploads directly to main Meilisearch index (upsert behavior)
Updates tracker dataset with new IDs

Files

New files:

migrations/init_embeddings_tracker.py - Initializes tracker dataset by reconstructing IDs from hf-doc-build/doc-build (no Meilisearch needed)
src/doc_builder/embeddings_tracker.py - Helper functions for tracking document IDs

Modified files:

src/doc_builder/commands/embeddings.py - Added --incremental and --hf_token flags to both commands
src/doc_builder/build_embeddings.py - Updated add_gradio_docs to support incremental mode
.github/workflows/populate_search_engine.yml - Default to incremental mode, with full_rebuild input option

Usage

Initialize tracker (one-time setup - already done):

uv run python migrations/init_embeddings_tracker.py --hf_token <token>

Incremental update (default in workflow):

uv run doc-builder populate-search-engine --incremental
uv run doc-builder add-gradio-docs --incremental

Full rebuild (when needed):

uv run doc-builder populate-search-engine
uv run doc-builder add-gradio-docs

Or trigger workflow with full_rebuild: true.

Setup required

HF_TOKEN secret needs to be added to the repo for updating the tracker dataset

Test plan

Run migration script to create initial tracker dataset (48,401 document IDs)
Run workflow in incremental mode - verify it skips existing docs
Modify a doc and run again - verify only the changed doc is processed
Run with full_rebuild=true - verify full rebuild still works

Currently the search engine population workflow recreates all embeddings every run, which is expensive. This adds an incremental mode that: - Tracks document IDs in a HuggingFace dataset (huggingface/doc-builder-embeddings-tracker) - Checks existing IDs before processing and skips already-embedded docs - Uploads directly to main index in incremental mode (no swap needed) - Falls back to full rebuild when triggered manually with full_rebuild=true Files added: - migrations/create_embeddings_dataset.py: One-time migration to init tracker dataset - src/doc_builder/embeddings_tracker.py: Helper functions for tracking Files modified: - src/doc_builder/commands/embeddings.py: Added --incremental and --hf_token flags - .github/workflows/populate_search_engine.yml: Default to incremental mode

The add-gradio-docs command now also supports --incremental flag to only process new/changed Gradio documentation, consistent with the main populate-search-engine command.

…e workflow to streamline the configuration.

Changed the dataset creation path from 'huggingface/doc-builder-embeddings-tracker' to 'hf-doc-build/doc-builder-embeddings-tracker' for consistency with the new repository structure.

…needed)

When a page is updated, the old version's document ID (with old content hash) remains in Meilisearch. This change: - Adds find_stale_ids() to identify old IDs that should be removed - Adds delete_documents_by_ids() to remove specific documents from Meilisearch - Updates incremental mode to delete stale entries before updating tracker - Tracker dataset now also removes stale IDs when updating

mishig25 added 8 commits January 27, 2026 13:19

ruff

9213f49

Add incremental mode support for Gradio docs

bd12c21

The add-gradio-docs command now also supports --incremental flag to only process new/changed Gradio documentation, consistent with the main populate-search-engine command.

Remove unused HfApi import

1fe74ca

Remove commented-out cleanup job steps from the populate_search_engin…

2eb4f47

…e workflow to streamline the configuration.

Update dataset repository path in create_embeddings_dataset.py

612adb1

Changed the dataset creation path from 'huggingface/doc-builder-embeddings-tracker' to 'hf-doc-build/doc-builder-embeddings-tracker' for consistency with the new repository structure.

Update migration to reconstruct IDs from source docs (no Meilisearch …

8d718f2

…needed)

Remove old migration that fetched IDs from Meilisearch

eda4f6d

mishig25 force-pushed the incremental-embeddings branch from d373cfa to eda4f6d Compare January 27, 2026 12:37

mishig25 added 4 commits January 27, 2026 13:37

format

bb80f18

Add Search Engine Population section to README

a347583

Move datasets to dev dependencies

7096e85

mishig25 force-pushed the incremental-embeddings branch from 1338a52 to fa960ac Compare January 27, 2026 12:43

mishig25 added 2 commits January 27, 2026 13:44

Handle deleted pages in stale ID detection

0cd6955

Fix unused loop variable warning

e44e633

mishig25 mentioned this pull request Jan 27, 2026

Summary: Meilisearch Indexing Changes #722

Closed

3 tasks

Add detailed logging for new/updated/deleted pages

83baf82

mishig25 force-pushed the incremental-embeddings branch from c20d10b to 83baf82 Compare January 27, 2026 13:13

Only process English (en) docs folder, skip other languages

45cfd76

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add incremental embeddings mode to reduce costs#737

Add incremental embeddings mode to reduce costs#737
mishig25 wants to merge 16 commits intomainfrom
incremental-embeddings

mishig25 commented Jan 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mishig25 commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Files

Usage

Setup required

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mishig25 commented Jan 27, 2026 •

edited

Loading