Skip to content

Improve ChromaDB loader with sentence-transformers backend#67

Open
freunda wants to merge 4 commits into
mainfrom
feature/chroma-loader-improvements
Open

Improve ChromaDB loader with sentence-transformers backend#67
freunda wants to merge 4 commits into
mainfrom
feature/chroma-loader-improvements

Conversation

@freunda
Copy link
Copy Markdown
Collaborator

@freunda freunda commented May 24, 2026

Adds a new chroma_loader module that improves on the legacy govt_data_loader with modern embedding capabilities while maintaining backward compatibility.

Key improvements:

  • sentence-transformers backend for cleaner API and better batching performance
  • Configurable max_length (1024 vs hardcoded 512) for better passage embeddings
  • Configurable batch_size with auto-tuning support
  • Generic load_or_build_chroma() supporting MT-RAG corpora and HuggingFace datasets
  • Flexible filter_ids parameter for document filtering
  • Eager loading architecture for clear upfront waiting time

Backward compatibility:

Changes:

  • Add src/granite_switch/tutorials/chroma_loader.py with new implementation
  • Update pyproject.toml with sentence-transformers>=3.0.0 and datasets>=2.0.0
  • Update rag_101.ipynb and rag_flow.ipynb to import from new module
  • Add deprecation warning to legacy govt_data_loader.py

Based on improvements from PR #58 with adjustments for project requirements.

Adds a new chroma_loader module that improves on the legacy govt_data_loader
with modern embedding capabilities while maintaining backward compatibility.

Key improvements:
- sentence-transformers backend for cleaner API and better batching performance
- Configurable max_length (1024 vs hardcoded 512) for better passage embeddings
- Configurable batch_size with auto-tuning support
- Generic load_or_build_chroma() supporting MT-RAG corpora and HuggingFace datasets
- Flexible filter_ids parameter for document filtering
- Eager loading architecture for clear upfront waiting time

Backward compatibility:
- load_or_build_govt_chroma() wrapper preserves exact API
- load_only_tutorial_docs parameter maintained for T4/CPU-friendly subset
- device parameter preserved for explicit CPU/GPU control
- TUTORIAL_DOC_IDS constant maintained (177 docs)
- No max_docs limit (unlike PR #58 which hardcoded 2000)

Changes:
- Add src/granite_switch/tutorials/chroma_loader.py with new implementation
- Update pyproject.toml with sentence-transformers>=3.0.0 and datasets>=2.0.0
- Update rag_101.ipynb and rag_flow.ipynb to import from new module
- Add deprecation warning to legacy govt_data_loader.py

Based on improvements from PR #58 with adjustments for project requirements.
@@ -1,12 +1,30 @@
"""Load or build the ChromaDB corpus for the govt RAG tutorial.

.. deprecated:: 0.2.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really keep this file?

Alon Freund added 3 commits May 24, 2026 20:25
Fixes TypeError when batch_size is None by omitting the parameter
to let sentence-transformers use auto-tuning.
Set max_seq_length on the model itself instead of passing it to encode().
Newer versions of sentence-transformers don't accept max_seq_length as
an encode() parameter. This fixes the ValueError that occurred when
calling encode() with max_seq_length in kwargs.

The max_seq_length is now set as a model attribute after instantiation,
which is the correct approach for configuring maximum sequence length
in sentence-transformers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants