distillreasoning

I borrowed reasoning from a 744B and a 1T model and taught it to a 4B model that runs on a laptop.

2,083 problems. Two teachers (GLM-5, Kimi K2.5). Qwen3.5-4B as student. SFT then GRPO. One controlled experiment.

The idea

Frontier models reason differently than small models — they think out loud, backtrack, check their work. That reasoning process is the valuable thing. If you can capture it and train a small model to imitate it, the small model punches way above its weight.

But does it matter which frontier model you learn from? GLM-5 writes verbose, detailed reasoning chains (median 433 tokens). Kimi K2.5 is more concise and elegant (median 325 tokens). Same problems, same student — does the teacher's style matter?

I fed 2,083 reasoning problems to both models via Ollama cloud, captured every thought, filtered through 8 quality gates, trained three separate Qwen3.5-4B students, and compared them against models up to 7x their size.

Every step — including the mistakes — is in the Dev Log.

The experiment

A key question in distillation is: does the teacher model matter? To test this, we picked two frontier reasoning models with very different styles:

	GLM-5	Kimi K2.5
Parameters	744B (40B active)	~1T (32B active)
AIME 2026	92.7%	Strong
Reasoning style	Verbose, thorough (median 433 tokens)	Concise, elegant (median 325 tokens)
Filter keep rate	83.7% (more accurate, but rambles)	86.5% (more errors, but cleaner traces)
License	MIT	MIT-compatible
Ollama tag	`glm-5:cloud`	`kimi-k2.5:cloud`

Both are free via Ollama cloud tags — the model runs on Ollama's servers, you just send API calls. MIT licensed means outputs can legally be used for training (we verified this).

The setup: Same 2,083 problems → both teachers → filter → train separate Qwen3.5-4B students → compare. Plus a third student trained on both trace sets combined. This tells us whether verbose reasoning (GLM-5) or concise reasoning (Kimi) transfers better to a small 4B model.

Benchmarks

Evaluated on 4 clean benchmarks (zero overlap with training data) + 5 hand-written trick questions. All benchmarks verified for contamination before scoring.

Results

Full benchmark results coming — using lm-evaluation-harness (the industry standard used by HuggingFace Open LLM Leaderboard) on Colab Pro H100. All 10 models evaluated with identical methodology.

Benchmark	Shots	Method	Task
GSM8K	8-shot CoT	Generative	`gsm8k_cot`
MATH	4-shot	Generative (`\boxed{}`)	`minerva_math`
ARC-Challenge	25-shot	Log-likelihood	`arc_challenge`
GPQA Diamond	0-shot	Log-likelihood	`gpqa_diamond`
MMLU-Pro	5-shot	Log-likelihood	`mmlu_pro`

Preliminary GSM8K (zero-shot, Tinker eval): Distillation showed a +35 point lift (37% base → 72% Kimi distilled) and the distilled 4B outperformed a raw Qwen3-8B. Concise Kimi traces transferred better than verbose GLM-5 traces. Full results with proper lm-eval methodology pending.

Why lm-eval? We initially wrote custom extraction code and got garbage: 0% on MATH (extraction only checked \boxed{}), 100% on ARC (regex too generous). ARC/GPQA/MMLU-Pro use log-likelihood scoring, not generation — our approach was fundamentally wrong. Details in the Dev Log.

We also caught a contamination issue mid-project — 94% overlap on our first eval attempt. Details in the Dev Log.

How it works

Step	What
1. Collect	GSM8K, MATH, ARC, HumanEval — 2,083 problems
2. Generate	Both teachers solve all problems with full `<think>` traces via Ollama cloud (4 parallel workers each)
3. Filter	8 quality gates: correctness, length bounds, coherence, repetition, structured reasoning
4. Format	Chat format with `<think>`/`<answer>` tags, stratified train/val split
5. SFT	LoRA fine-tune 3 models on Tinker: GLM-5 traces, Kimi traces, combined
6. Benchmark	Eval on 4 clean benchmarks + 5 trick questions, 8 models compared
7. GRPO	Reinforcement learning on all 3 SFT models — reward correct answers
8. Final eval	Full comparison on Colab Pro (fast local GPU inference)
9. Export	Merge LoRA → push to HuggingFace → GGUF for Ollama

Data pipeline

Trace generation

4,166 total traces generated (2,083 per teacher) using Ollama cloud with 4 parallel workers per model. ~7 hours each, running concurrently.

8-gate quality filter

Gate	What it checks	GLM-5 dropped	Kimi dropped
1. Non-empty	Thinking + response exist	2	0
2. Language quality	No encoding artifacts	0	0
3. Length bounds	50-4000 thinking tokens	101	5
4. Correctness	Answer matches expected	160	235
5. Repetition	No degenerate loops	0	0
6. Coherence	Thinking references problem	22	25
7. Self-contradiction	Max 2 self-corrections	0	0
8. Structured reasoning	Step indicators present	51	0

GLM-5: 1,744/2,083 kept (83.7%) — more accurate but very verbose Kimi: 1,802/2,083 kept (86.5%) — more errors but cleaner traces

Training data

Dataset	Train	Val
GLM-5 traces	1,572	172
Kimi traces	1,624	178
Combined	3,196	350

Quick start

git clone https://github.com/brianmeyer/distillreasoning.git
cd distillreasoning
python3 -m venv venv && source venv/bin/activate
pip install ollama datasets huggingface_hub tinker tinker-cookbook

# 1. Download 2,083 problems
python scripts/download_problems.py

# 2. Generate reasoning traces (runs overnight, 4 parallel workers each)
python scripts/generate_traces.py         # GLM-5
python scripts/generate_traces_kimi.py    # Kimi K2.5

# 3. Filter + format
python scripts/filter_traces.py glm5
python scripts/filter_traces.py kimi
python scripts/format_for_sft.py glm5
python scripts/format_for_sft.py kimi

# 4. Upload to HuggingFace
HF_TOKEN=xxx python scripts/upload_dataset.py

# 5. Train on Tinker (3 models in parallel)
TINKER_API_KEY=xxx python scripts/train_tinker.py

# 6. GRPO + eval + merge + publish (Colab Pro H100)
# Open notebooks/distill_pipeline.ipynb

Or just use the pre-built datasets:

from datasets import load_dataset

# GLM-5 traces
ds = load_dataset("bmeyer2025/glm5-reasoning-traces-sft")

# Kimi traces
ds = load_dataset("bmeyer2025/kimi-reasoning-traces-sft")

Project structure

distillreasoning/
├── scripts/
│   ├── download_problems.py       # Pull GSM8K, MATH, ARC, HumanEval from HF
│   ├── generate_traces.py         # GLM-5 → reasoning traces (4 parallel workers)
│   ├── generate_traces_kimi.py    # Kimi K2.5 → reasoning traces
│   ├── filter_traces.py           # 8-gate quality filter (accepts: glm5|kimi)
│   ├── format_for_sft.py          # Stratified chat format (accepts: glm5|kimi)
│   ├── upload_dataset.py          # Push 4 datasets to HuggingFace
│   └── train_tinker.py            # LoRA SFT on Tinker (3 models)
├── notebooks/
│   └── distill_pipeline.ipynb     # Colab Pro H100: GRPO + lm-eval + merge + publish
├── cards/
│   ├── dataset_card.md            # HuggingFace dataset card
│   └── model_card.md              # HuggingFace model card
├── images/                        # Generated hero images (Gemini)
├── DEVLOG.md                      # Full build log — every step and mistake
└── README.md

Datasets

Dataset	Teacher	Format	Link
glm5-reasoning-traces	GLM-5	Raw (problem, thinking, response)	HuggingFace
glm5-reasoning-traces-sft	GLM-5	SFT-ready (train/val)	HuggingFace
kimi-reasoning-traces	Kimi K2.5	Raw (problem, thinking, response)	HuggingFace
kimi-reasoning-traces-sft	Kimi K2.5	SFT-ready (train/val)	HuggingFace

What I learned

Concise teachers beat verbose ones for small students. Kimi's 325-token median traces produced a better 4B student than GLM-5's 433-token traces. A 4B model can't absorb 6,000 words of reasoning — it overwhelms the model's capacity.
Distillation can beat 2x model size. Our 4B distilled model (72.6% GSM8K) outperforms a raw Qwen3-8B (63.0%). Targeted training on reasoning traces is more effective than raw scale.
Benchmark contamination is easy to miss. Our first eval showed 75-80% accuracy — actually 94% data overlap with training. Always verify eval data is clean before celebrating.
The teacher's reasoning style transfers, not just answers. The distilled model reasons step-by-step, sets up variables, writes equations, and verifies answers — patterns learned from the teacher traces. The reasoning skill transferred, not just the format.
More data isn't always better. Combined traces (3,196) scored 71.3% vs Kimi-only (1,624) at 72.6%. The verbose GLM-5 traces in the combined set may have added noise rather than signal.

(More findings after GRPO and final eval)

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

distillreasoning

The idea

The experiment

Benchmarks

Results

How it works

Data pipeline

Trace generation

8-gate quality filter

Training data

Quick start

Project structure

Datasets

What I learned

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.remember		.remember
cards		cards
images		images
kaggle		kaggle
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
DEVLOG.md		DEVLOG.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

distillreasoning

The idea

The experiment

Benchmarks

Results

How it works

Data pipeline

Trace generation

8-gate quality filter

Training data

Quick start

Project structure

Datasets

What I learned

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages