Improve ChromaDB loader with sentence-transformers backend#67
Open
freunda wants to merge 4 commits into
Open
Conversation
Adds a new chroma_loader module that improves on the legacy govt_data_loader with modern embedding capabilities while maintaining backward compatibility. Key improvements: - sentence-transformers backend for cleaner API and better batching performance - Configurable max_length (1024 vs hardcoded 512) for better passage embeddings - Configurable batch_size with auto-tuning support - Generic load_or_build_chroma() supporting MT-RAG corpora and HuggingFace datasets - Flexible filter_ids parameter for document filtering - Eager loading architecture for clear upfront waiting time Backward compatibility: - load_or_build_govt_chroma() wrapper preserves exact API - load_only_tutorial_docs parameter maintained for T4/CPU-friendly subset - device parameter preserved for explicit CPU/GPU control - TUTORIAL_DOC_IDS constant maintained (177 docs) - No max_docs limit (unlike PR #58 which hardcoded 2000) Changes: - Add src/granite_switch/tutorials/chroma_loader.py with new implementation - Update pyproject.toml with sentence-transformers>=3.0.0 and datasets>=2.0.0 - Update rag_101.ipynb and rag_flow.ipynb to import from new module - Add deprecation warning to legacy govt_data_loader.py Based on improvements from PR #58 with adjustments for project requirements.
antonpibm
reviewed
May 24, 2026
| @@ -1,12 +1,30 @@ | |||
| """Load or build the ChromaDB corpus for the govt RAG tutorial. | |||
|
|
|||
| .. deprecated:: 0.2.0 | |||
Collaborator
There was a problem hiding this comment.
Should we really keep this file?
added 3 commits
May 24, 2026 20:25
Fixes TypeError when batch_size is None by omitting the parameter to let sentence-transformers use auto-tuning.
Set max_seq_length on the model itself instead of passing it to encode(). Newer versions of sentence-transformers don't accept max_seq_length as an encode() parameter. This fixes the ValueError that occurred when calling encode() with max_seq_length in kwargs. The max_seq_length is now set as a model attribute after instantiation, which is the correct approach for configuring maximum sequence length in sentence-transformers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new chroma_loader module that improves on the legacy govt_data_loader with modern embedding capabilities while maintaining backward compatibility.
Key improvements:
Backward compatibility:
Changes:
Based on improvements from PR #58 with adjustments for project requirements.