Improve ChromaDB loader with sentence-transformers backend by freunda · Pull Request #67 · generative-computing/granite-switch

freunda · 2026-05-24T15:16:50Z

Adds a new chroma_loader module that improves on the legacy govt_data_loader with modern embedding capabilities while maintaining backward compatibility.

Key improvements:

sentence-transformers backend for cleaner API and better batching performance
Configurable max_length (1024 vs hardcoded 512) for better passage embeddings
Configurable batch_size with auto-tuning support
Generic load_or_build_chroma() supporting MT-RAG corpora and HuggingFace datasets
Flexible filter_ids parameter for document filtering
Eager loading architecture for clear upfront waiting time

Backward compatibility:

load_or_build_govt_chroma() wrapper preserves exact API
load_only_tutorial_docs parameter maintained for T4/CPU-friendly subset
device parameter preserved for explicit CPU/GPU control
TUTORIAL_DOC_IDS constant maintained (177 docs)
No max_docs limit (unlike PR refactor: improve RAG tutorial notebooks and generalize corpus loader #58 which hardcoded 2000)

Changes:

Add src/granite_switch/tutorials/chroma_loader.py with new implementation
Update pyproject.toml with sentence-transformers>=3.0.0 and datasets>=2.0.0
Update rag_101.ipynb and rag_flow.ipynb to import from new module
Add deprecation warning to legacy govt_data_loader.py

Based on improvements from PR #58 with adjustments for project requirements.

Adds a new chroma_loader module that improves on the legacy govt_data_loader with modern embedding capabilities while maintaining backward compatibility. Key improvements: - sentence-transformers backend for cleaner API and better batching performance - Configurable max_length (1024 vs hardcoded 512) for better passage embeddings - Configurable batch_size with auto-tuning support - Generic load_or_build_chroma() supporting MT-RAG corpora and HuggingFace datasets - Flexible filter_ids parameter for document filtering - Eager loading architecture for clear upfront waiting time Backward compatibility: - load_or_build_govt_chroma() wrapper preserves exact API - load_only_tutorial_docs parameter maintained for T4/CPU-friendly subset - device parameter preserved for explicit CPU/GPU control - TUTORIAL_DOC_IDS constant maintained (177 docs) - No max_docs limit (unlike PR #58 which hardcoded 2000) Changes: - Add src/granite_switch/tutorials/chroma_loader.py with new implementation - Update pyproject.toml with sentence-transformers>=3.0.0 and datasets>=2.0.0 - Update rag_101.ipynb and rag_flow.ipynb to import from new module - Add deprecation warning to legacy govt_data_loader.py Based on improvements from PR #58 with adjustments for project requirements.

antonpibm · 2026-05-24T15:21:27Z

@@ -1,12 +1,30 @@
 """Load or build the ChromaDB corpus for the govt RAG tutorial.

+.. deprecated:: 0.2.0


Should we really keep this file?

Fixes TypeError when batch_size is None by omitting the parameter to let sentence-transformers use auto-tuning.

Set max_seq_length on the model itself instead of passing it to encode(). Newer versions of sentence-transformers don't accept max_seq_length as an encode() parameter. This fixes the ValueError that occurred when calling encode() with max_seq_length in kwargs. The max_seq_length is now set as a model attribute after instantiation, which is the correct approach for configuring maximum sequence length in sentence-transformers.

freunda requested review from antonpibm and yairallouche as code owners May 24, 2026 15:16

freunda requested a review from oferbillerilibmcom May 24, 2026 15:17

antonpibm reviewed May 24, 2026

View reviewed changes

Alon Freund added 3 commits May 24, 2026 20:25

Remove deprecated govt_data_loader.py

85d0dc2

Fix batch_size handling for sentence-transformers

0d6ad08

Fixes TypeError when batch_size is None by omitting the parameter to let sentence-transformers use auto-tuning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ChromaDB loader with sentence-transformers backend#67

Improve ChromaDB loader with sentence-transformers backend#67
freunda wants to merge 4 commits into
mainfrom
feature/chroma-loader-improvements

freunda commented May 24, 2026

Uh oh!

antonpibm May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1,12 +1,30 @@
		"""Load or build the ChromaDB corpus for the govt RAG tutorial.

		.. deprecated:: 0.2.0

Conversation

freunda commented May 24, 2026

Uh oh!

antonpibm May 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants