I borrowed reasoning from a 744B and a 1T model and taught it to a 4B model that runs on a laptop.
2,083 problems. Two teachers (GLM-5, Kimi K2.5). Qwen3.5-4B as student. SFT then GRPO. One controlled experiment.
Dev Log | GLM-5 Dataset | Kimi Dataset
Frontier models reason differently than small models — they think out loud, backtrack, check their work. That reasoning process is the valuable thing. If you can capture it and train a small model to imitate it, the small model punches way above its weight.
But does it matter which frontier model you learn from? GLM-5 writes verbose, detailed reasoning chains (median 433 tokens). Kimi K2.5 is more concise and elegant (median 325 tokens). Same problems, same student — does the teacher's style matter?
I fed 2,083 reasoning problems to both models via Ollama cloud, captured every thought, filtered through 8 quality gates, trained three separate Qwen3.5-4B students, and compared them against models up to 7x their size.
Every step — including the mistakes — is in the Dev Log.
A key question in distillation is: does the teacher model matter? To test this, we picked two frontier reasoning models with very different styles:
| GLM-5 | Kimi K2.5 | |
|---|---|---|
| Parameters | 744B (40B active) | ~1T (32B active) |
| AIME 2026 | 92.7% | Strong |
| Reasoning style | Verbose, thorough (median 433 tokens) | Concise, elegant (median 325 tokens) |
| Filter keep rate | 83.7% (more accurate, but rambles) | 86.5% (more errors, but cleaner traces) |
| License | MIT | MIT-compatible |
| Ollama tag | glm-5:cloud |
kimi-k2.5:cloud |
Both are free via Ollama cloud tags — the model runs on Ollama's servers, you just send API calls. MIT licensed means outputs can legally be used for training (we verified this).
The setup: Same 2,083 problems → both teachers → filter → train separate Qwen3.5-4B students → compare. Plus a third student trained on both trace sets combined. This tells us whether verbose reasoning (GLM-5) or concise reasoning (Kimi) transfers better to a small 4B model.
Evaluated on 4 clean benchmarks (zero overlap with training data) + 5 hand-written trick questions. All benchmarks verified for contamination before scoring.
Full benchmark results coming — using lm-evaluation-harness (the industry standard used by HuggingFace Open LLM Leaderboard) on Colab Pro H100. All 10 models evaluated with identical methodology.
| Benchmark | Shots | Method | Task |
|---|---|---|---|
| GSM8K | 8-shot CoT | Generative | gsm8k_cot |
| MATH | 4-shot | Generative (\boxed{}) |
minerva_math |
| ARC-Challenge | 25-shot | Log-likelihood | arc_challenge |
| GPQA Diamond | 0-shot | Log-likelihood | gpqa_diamond |
| MMLU-Pro | 5-shot | Log-likelihood | mmlu_pro |
Preliminary GSM8K (zero-shot, Tinker eval): Distillation showed a +35 point lift (37% base → 72% Kimi distilled) and the distilled 4B outperformed a raw Qwen3-8B. Concise Kimi traces transferred better than verbose GLM-5 traces. Full results with proper lm-eval methodology pending.
Why lm-eval? We initially wrote custom extraction code and got garbage: 0% on MATH (extraction only checked \boxed{}), 100% on ARC (regex too generous). ARC/GPQA/MMLU-Pro use log-likelihood scoring, not generation — our approach was fundamentally wrong. Details in the Dev Log.
We also caught a contamination issue mid-project — 94% overlap on our first eval attempt. Details in the Dev Log.
| Step | What |
|---|---|
| 1. Collect | GSM8K, MATH, ARC, HumanEval — 2,083 problems |
| 2. Generate | Both teachers solve all problems with full <think> traces via Ollama cloud (4 parallel workers each) |
| 3. Filter | 8 quality gates: correctness, length bounds, coherence, repetition, structured reasoning |
| 4. Format | Chat format with <think>/<answer> tags, stratified train/val split |
| 5. SFT | LoRA fine-tune 3 models on Tinker: GLM-5 traces, Kimi traces, combined |
| 6. Benchmark | Eval on 4 clean benchmarks + 5 trick questions, 8 models compared |
| 7. GRPO | Reinforcement learning on all 3 SFT models — reward correct answers |
| 8. Final eval | Full comparison on Colab Pro (fast local GPU inference) |
| 9. Export | Merge LoRA → push to HuggingFace → GGUF for Ollama |
4,166 total traces generated (2,083 per teacher) using Ollama cloud with 4 parallel workers per model. ~7 hours each, running concurrently.
| Gate | What it checks | GLM-5 dropped | Kimi dropped |
|---|---|---|---|
| 1. Non-empty | Thinking + response exist | 2 | 0 |
| 2. Language quality | No encoding artifacts | 0 | 0 |
| 3. Length bounds | 50-4000 thinking tokens | 101 | 5 |
| 4. Correctness | Answer matches expected | 160 | 235 |
| 5. Repetition | No degenerate loops | 0 | 0 |
| 6. Coherence | Thinking references problem | 22 | 25 |
| 7. Self-contradiction | Max 2 self-corrections | 0 | 0 |
| 8. Structured reasoning | Step indicators present | 51 | 0 |
GLM-5: 1,744/2,083 kept (83.7%) — more accurate but very verbose Kimi: 1,802/2,083 kept (86.5%) — more errors but cleaner traces
| Dataset | Train | Val |
|---|---|---|
| GLM-5 traces | 1,572 | 172 |
| Kimi traces | 1,624 | 178 |
| Combined | 3,196 | 350 |
git clone https://github.com/brianmeyer/distillreasoning.git
cd distillreasoning
python3 -m venv venv && source venv/bin/activate
pip install ollama datasets huggingface_hub tinker tinker-cookbook
# 1. Download 2,083 problems
python scripts/download_problems.py
# 2. Generate reasoning traces (runs overnight, 4 parallel workers each)
python scripts/generate_traces.py # GLM-5
python scripts/generate_traces_kimi.py # Kimi K2.5
# 3. Filter + format
python scripts/filter_traces.py glm5
python scripts/filter_traces.py kimi
python scripts/format_for_sft.py glm5
python scripts/format_for_sft.py kimi
# 4. Upload to HuggingFace
HF_TOKEN=xxx python scripts/upload_dataset.py
# 5. Train on Tinker (3 models in parallel)
TINKER_API_KEY=xxx python scripts/train_tinker.py
# 6. GRPO + eval + merge + publish (Colab Pro H100)
# Open notebooks/distill_pipeline.ipynbOr just use the pre-built datasets:
from datasets import load_dataset
# GLM-5 traces
ds = load_dataset("bmeyer2025/glm5-reasoning-traces-sft")
# Kimi traces
ds = load_dataset("bmeyer2025/kimi-reasoning-traces-sft")distillreasoning/
├── scripts/
│ ├── download_problems.py # Pull GSM8K, MATH, ARC, HumanEval from HF
│ ├── generate_traces.py # GLM-5 → reasoning traces (4 parallel workers)
│ ├── generate_traces_kimi.py # Kimi K2.5 → reasoning traces
│ ├── filter_traces.py # 8-gate quality filter (accepts: glm5|kimi)
│ ├── format_for_sft.py # Stratified chat format (accepts: glm5|kimi)
│ ├── upload_dataset.py # Push 4 datasets to HuggingFace
│ └── train_tinker.py # LoRA SFT on Tinker (3 models)
├── notebooks/
│ └── distill_pipeline.ipynb # Colab Pro H100: GRPO + lm-eval + merge + publish
├── cards/
│ ├── dataset_card.md # HuggingFace dataset card
│ └── model_card.md # HuggingFace model card
├── images/ # Generated hero images (Gemini)
├── DEVLOG.md # Full build log — every step and mistake
└── README.md
| Dataset | Teacher | Format | Link |
|---|---|---|---|
| glm5-reasoning-traces | GLM-5 | Raw (problem, thinking, response) | HuggingFace |
| glm5-reasoning-traces-sft | GLM-5 | SFT-ready (train/val) | HuggingFace |
| kimi-reasoning-traces | Kimi K2.5 | Raw (problem, thinking, response) | HuggingFace |
| kimi-reasoning-traces-sft | Kimi K2.5 | SFT-ready (train/val) | HuggingFace |
-
Concise teachers beat verbose ones for small students. Kimi's 325-token median traces produced a better 4B student than GLM-5's 433-token traces. A 4B model can't absorb 6,000 words of reasoning — it overwhelms the model's capacity.
-
Distillation can beat 2x model size. Our 4B distilled model (72.6% GSM8K) outperforms a raw Qwen3-8B (63.0%). Targeted training on reasoning traces is more effective than raw scale.
-
Benchmark contamination is easy to miss. Our first eval showed 75-80% accuracy — actually 94% data overlap with training. Always verify eval data is clean before celebrating.
-
The teacher's reasoning style transfers, not just answers. The distilled model reasons step-by-step, sets up variables, writes equations, and verifies answers — patterns learned from the teacher traces. The reasoning skill transferred, not just the format.
-
More data isn't always better. Combined traces (3,196) scored 71.3% vs Kimi-only (1,624) at 72.6%. The verbose GLM-5 traces in the combined set may have added noise rather than signal.
(More findings after GRPO and final eval)
MIT
