This repo provides an end-to-end, reproducible pipeline to align an open Hugging Face LLM for finance/risk-facing assistant behavior under a small budget:
- SFT (instruction tuning) with QLoRA (LoRA on 4-bit base)
- Preference dataset generation (chosen/rejected) with a transparent rubric
- DPO (RLHF-style preference optimization) with TRL
- Lightweight eval + reporting
- Optional FastAPI serving
- Conventional AWS architecture: EC2 (GPU) + S3 (artifacts) + IAM (least privilege)
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt
pip install -e .bitsandbytes is installed only on Linux (GPU target). On non-Linux machines, run smoke tests with load_in_4bit: false.
export HF_HOME=/mnt/ebs/hf
export TRANSFORMERS_CACHE=/mnt/ebs/hf/transformers
export HF_DATASETS_CACHE=/mnt/ebs/hf/datasets
mkdir -p "$HF_HOME"If you use a gated model:
export HF_TOKEN=...bash scripts/train_sft.sh
bash scripts/make_prefs.sh
bash scripts/train_dpo.sh
bash scripts/run_eval.shexport S3_URI="s3://YOUR_BUCKET/finrlhf"
bash scripts/sync_s3.shRuns a minimal end-to-end pipeline (prepare_sft -> sft -> make_preferences -> dpo) for validation.
bash scripts/smoke_test.shoutputs/sft/(SFT adapter)outputs/prefs/(preference JSONL)outputs/dpo/(DPO adapter)reports/results.json(eval summary)
Train SFT:
python -m finrlhf.data.prepare_sft --config configs/sft_qwen25_7b.yaml
python -m finrlhf.train.sft --config configs/sft_qwen25_7b.yamlGenerate preference pairs:
python -m finrlhf.data.make_preferences --config configs/prefs_qwen25_7b.yamlTrain DPO:
python -m finrlhf.train.dpo --config configs/dpo_qwen25_7b.yamlEval:
python -m finrlhf.eval.run_eval --config configs/eval.yamlServe:
python -m finrlhf.serve.app --config configs/serve.yaml- QLoRA (4-bit) + small batch + grad accumulation
- seq_len <= 1024
- save adapters, not merged full weights
- prune checkpoints
- store only key artifacts in S3
See docs/architecture.md.