refactor: improve RAG tutorial notebooks and generalize corpus loader by hansolosan · Pull Request #58 · generative-computing/granite-switch

hansolosan · 2026-05-22T21:03:29Z

Summary

Switch to uv for notebook package installation (%pip install -q uv → !uv pip install); bump requires-python to >=3.11
Update embedding configuration: max_length 1024→2048, add max_docs=2000 to cap corpus size, remove hard-coded device="cpu" to auto-detect GPU
Rename and generalize corpus loader: govt_data_loader.py → chroma_loader.py with a new load_or_build_chroma() that supports MT-RAG benchmark corpora by name, local JSONL files, and HuggingFace datasets; the existing load_or_build_govt_chroma() wrapper is kept for backward compatibility
Add FIQA corpus example in rag_full_pipeline.ipynb showing how to switch to a finance domain corpus with a single call
Add dependencies: datasets>=2.0.0 and sentence-transformers>=3.0.0 to the tutorials extra

Test plan

Open rag_full_pipeline.ipynb in Colab (T4 GPU), run install cell — uv pip install completes without error
Confirm chroma_loader import succeeds and govt_data_loader is gone
Run the govt corpus load cell — load_or_build_govt_chroma downloads, embeds 2000 docs, and persists to ./govt_chroma
Run the FIQA example cell — load_or_build_chroma(corpus_name="fiqa", ...) downloads and embeds successfully
Open rag_101.ipynb in Colab, confirm install cell and corpus load cell both work
git log --oneline shows exactly 3 commits on top of main

Install uv first via %pip, then use it for the main package install. Adds a commented-out alternative source-install line for development. Bumps requires-python from >=3.10 to >=3.11 to align with Python 3.11+ features used by the tutorials.

Set max_length=2048 (up from 1024) and max_docs=2000 to use more context and index a larger portion of the corpus. Remove device="cpu" to enable automatic GPU detection (CUDA when available). Remove load_only_tutorial_docs=True to embed the full configured document count instead of a curated subset.

Rename govt_data_loader.py to chroma_loader.py and expose a generic load_or_build_chroma() function that supports MT-RAG benchmark corpora (by name), local JSONL files, and HuggingFace datasets. The govt-specific load_or_build_govt_chroma() wrapper is kept for backward compatibility. Add a FIQA corpus example cell in rag_full_pipeline.ipynb showing how to switch to a finance domain corpus with a single load_or_build_chroma() call. Add datasets>=2.0.0 and sentence-transformers>=3.0.0 to the tutorials optional dependency group for HF dataset loading and embedding support.

hansolosan · 2026-05-22T22:47:30Z

@lastras fyi.

hansolosan added 3 commits May 22, 2026 17:00

refactor: switch package manager to uv

ef9703d

Install uv first via %pip, then use it for the main package install. Adds a commented-out alternative source-install line for development. Bumps requires-python from >=3.10 to >=3.11 to align with Python 3.11+ features used by the tutorials.

hansolosan requested review from antonpibm, freunda and yairallouche as code owners May 22, 2026 21:03

freunda mentioned this pull request May 24, 2026

Improve ChromaDB loader with sentence-transformers backend #67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: improve RAG tutorial notebooks and generalize corpus loader#58

refactor: improve RAG tutorial notebooks and generalize corpus loader#58
hansolosan wants to merge 3 commits into
generative-computing:mainfrom
primeqa:pr/rag-notebook-improvements

hansolosan commented May 22, 2026

Uh oh!

hansolosan commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hansolosan commented May 22, 2026

Summary

Test plan

Uh oh!

hansolosan commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant