Add incremental embeddings mode to reduce costs#737
Open
Conversation
Currently the search engine population workflow recreates all embeddings every run, which is expensive. This adds an incremental mode that: - Tracks document IDs in a HuggingFace dataset (huggingface/doc-builder-embeddings-tracker) - Checks existing IDs before processing and skips already-embedded docs - Uploads directly to main index in incremental mode (no swap needed) - Falls back to full rebuild when triggered manually with full_rebuild=true Files added: - migrations/create_embeddings_dataset.py: One-time migration to init tracker dataset - src/doc_builder/embeddings_tracker.py: Helper functions for tracking Files modified: - src/doc_builder/commands/embeddings.py: Added --incremental and --hf_token flags - .github/workflows/populate_search_engine.yml: Default to incremental mode
The add-gradio-docs command now also supports --incremental flag to only process new/changed Gradio documentation, consistent with the main populate-search-engine command.
…e workflow to streamline the configuration.
Changed the dataset creation path from 'huggingface/doc-builder-embeddings-tracker' to 'hf-doc-build/doc-builder-embeddings-tracker' for consistency with the new repository structure.
d373cfa to
eda4f6d
Compare
When a page is updated, the old version's document ID (with old content hash) remains in Meilisearch. This change: - Adds find_stale_ids() to identify old IDs that should be removed - Adds delete_documents_by_ids() to remove specific documents from Meilisearch - Updates incremental mode to delete stale entries before updating tracker - Tracker dataset now also removes stale IDs when updating
1338a52 to
fa960ac
Compare
3 tasks
c20d10b to
83baf82
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
populate-search-engineandadd-gradio-docscommandsHow it works
Document IDs are generated deterministically:
{library}-{page}-{sha256_hash_of_text[:8]}In incremental mode:
hf-doc-build/doc-builder-embeddings-trackerdatasetFiles
New files:
migrations/init_embeddings_tracker.py- Initializes tracker dataset by reconstructing IDs fromhf-doc-build/doc-build(no Meilisearch needed)src/doc_builder/embeddings_tracker.py- Helper functions for tracking document IDsModified files:
src/doc_builder/commands/embeddings.py- Added--incrementaland--hf_tokenflags to both commandssrc/doc_builder/build_embeddings.py- Updatedadd_gradio_docsto support incremental mode.github/workflows/populate_search_engine.yml- Default to incremental mode, withfull_rebuildinput optionUsage
Initialize tracker (one-time setup - already done):
Incremental update (default in workflow):
Full rebuild (when needed):
Or trigger workflow with
full_rebuild: true.Setup required
HF_TOKENsecret needs to be added to the repo for updating the tracker datasetTest plan
full_rebuild=true- verify full rebuild still works