refactor: improve RAG tutorial notebooks and generalize corpus loader#58
Open
hansolosan wants to merge 3 commits into
Open
refactor: improve RAG tutorial notebooks and generalize corpus loader#58hansolosan wants to merge 3 commits into
hansolosan wants to merge 3 commits into
Conversation
Install uv first via %pip, then use it for the main package install. Adds a commented-out alternative source-install line for development. Bumps requires-python from >=3.10 to >=3.11 to align with Python 3.11+ features used by the tutorials.
Set max_length=2048 (up from 1024) and max_docs=2000 to use more context and index a larger portion of the corpus. Remove device="cpu" to enable automatic GPU detection (CUDA when available). Remove load_only_tutorial_docs=True to embed the full configured document count instead of a curated subset.
Rename govt_data_loader.py to chroma_loader.py and expose a generic load_or_build_chroma() function that supports MT-RAG benchmark corpora (by name), local JSONL files, and HuggingFace datasets. The govt-specific load_or_build_govt_chroma() wrapper is kept for backward compatibility. Add a FIQA corpus example cell in rag_full_pipeline.ipynb showing how to switch to a finance domain corpus with a single load_or_build_chroma() call. Add datasets>=2.0.0 and sentence-transformers>=3.0.0 to the tutorials optional dependency group for HF dataset loading and embedding support.
Author
|
@lastras fyi. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
%pip install -q uv→!uv pip install); bumprequires-pythonto>=3.11max_length1024→2048, addmax_docs=2000to cap corpus size, remove hard-codeddevice="cpu"to auto-detect GPUgovt_data_loader.py→chroma_loader.pywith a newload_or_build_chroma()that supports MT-RAG benchmark corpora by name, local JSONL files, and HuggingFace datasets; the existingload_or_build_govt_chroma()wrapper is kept for backward compatibilityrag_full_pipeline.ipynbshowing how to switch to a finance domain corpus with a single calldatasets>=2.0.0andsentence-transformers>=3.0.0to thetutorialsextraTest plan
rag_full_pipeline.ipynbin Colab (T4 GPU), run install cell —uv pip installcompletes without errorchroma_loaderimport succeeds andgovt_data_loaderis goneload_or_build_govt_chromadownloads, embeds 2000 docs, and persists to./govt_chromaload_or_build_chroma(corpus_name="fiqa", ...)downloads and embeds successfullyrag_101.ipynbin Colab, confirm install cell and corpus load cell both workgit log --onelineshows exactly 3 commits on top ofmain