Skip to content

refactor: improve RAG tutorial notebooks and generalize corpus loader#58

Open
hansolosan wants to merge 3 commits into
generative-computing:mainfrom
primeqa:pr/rag-notebook-improvements
Open

refactor: improve RAG tutorial notebooks and generalize corpus loader#58
hansolosan wants to merge 3 commits into
generative-computing:mainfrom
primeqa:pr/rag-notebook-improvements

Conversation

@hansolosan
Copy link
Copy Markdown

Summary

  • Switch to uv for notebook package installation (%pip install -q uv!uv pip install); bump requires-python to >=3.11
  • Update embedding configuration: max_length 1024→2048, add max_docs=2000 to cap corpus size, remove hard-coded device="cpu" to auto-detect GPU
  • Rename and generalize corpus loader: govt_data_loader.pychroma_loader.py with a new load_or_build_chroma() that supports MT-RAG benchmark corpora by name, local JSONL files, and HuggingFace datasets; the existing load_or_build_govt_chroma() wrapper is kept for backward compatibility
  • Add FIQA corpus example in rag_full_pipeline.ipynb showing how to switch to a finance domain corpus with a single call
  • Add dependencies: datasets>=2.0.0 and sentence-transformers>=3.0.0 to the tutorials extra

Test plan

  • Open rag_full_pipeline.ipynb in Colab (T4 GPU), run install cell — uv pip install completes without error
  • Confirm chroma_loader import succeeds and govt_data_loader is gone
  • Run the govt corpus load cell — load_or_build_govt_chroma downloads, embeds 2000 docs, and persists to ./govt_chroma
  • Run the FIQA example cell — load_or_build_chroma(corpus_name="fiqa", ...) downloads and embeds successfully
  • Open rag_101.ipynb in Colab, confirm install cell and corpus load cell both work
  • git log --oneline shows exactly 3 commits on top of main

Install uv first via %pip, then use it for the main package install. Adds
a commented-out alternative source-install line for development. Bumps
requires-python from >=3.10 to >=3.11 to align with Python 3.11+ features
used by the tutorials.
Set max_length=2048 (up from 1024) and max_docs=2000 to use more context and
index a larger portion of the corpus. Remove device="cpu" to enable automatic
GPU detection (CUDA when available). Remove load_only_tutorial_docs=True to
embed the full configured document count instead of a curated subset.
Rename govt_data_loader.py to chroma_loader.py and expose a generic
load_or_build_chroma() function that supports MT-RAG benchmark corpora
(by name), local JSONL files, and HuggingFace datasets. The govt-specific
load_or_build_govt_chroma() wrapper is kept for backward compatibility.

Add a FIQA corpus example cell in rag_full_pipeline.ipynb showing how to
switch to a finance domain corpus with a single load_or_build_chroma() call.

Add datasets>=2.0.0 and sentence-transformers>=3.0.0 to the tutorials
optional dependency group for HF dataset loading and embedding support.
@hansolosan
Copy link
Copy Markdown
Author

@lastras fyi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant