Benchmark local LLMs by what actually matters.
BenchLoop is a local-first CLI + web app for benchmarking LLMs running on your own hardware or cloud providers. It scores models across seven repeatable suites — quality, speed, reliability, agentic tool use, coding, instruction following — and gives you receipts: per-task outputs, latency, token counts, machine info, scores.
No accounts, no telemetry. Local models need no API keys; cloud providers use standard OpenAI-compatible auth. Your model, your machine (or your provider), your numbers.
$ benchloop run --model qwen3:8b --suites speed,toolcall,agent
... 8 tasks, 4 tools, 6 turns avg, 74.6 tok/s ...
Overall 73.4 ████████░░
Quality 73.6 ████████░░
Speed 78.9 █████████░
Agent 96.9 █████████▌
Published runs live at https://bench-loop.com/leaderboard. Every completed local benchmark auto-publishes there.
Hosted LLM leaderboards answer "which model wins on a server farm someone else paid for?" BenchLoop answers "which model + harness + hardware combination actually works for me right now?" — the question you have when picking a local stack.
It is repeatable on purpose: every run persists to disk, the task set is frozen, the scorer is deterministic. If you say "qwen3:8b scored 89 on my 4090", anyone can install BenchLoop and verify it.
pipx install benchloop-cli
benchloop --versionThe PyPI distribution is named
benchloop-cli(the barebenchloopname was taken by an unrelated dataset library). The installed commands are stillbenchloopandbench-loop.
pip install benchloop-cligit clone https://github.com/outsourc-e/bench-loop
cd bench-loop
pip install -e .Make sure you have a local LLM endpoint running. Anything OpenAI-compatible or Ollama-flavored works:
- Ollama at
http://localhost:11434(default) - LM Studio at
http://localhost:1234(--provider openai_compat) - MLX / Osaurus at
http://localhost:8000(--provider openai_compat) - vLLM, Jan, llama-server, etc.
Then:
benchloop run \
--model qwen3:8b \
--endpoint http://localhost:11434 \
--provider ollamaThis runs every default suite, scores them, prints a console report, and persists the full run to ~/.bench-loop/runs/.
benchloop run --model qwen3:8b --suites speed,agentSame model, four ways to talk to it:
benchloop run --model qwen3:8b --harness raw # native tool calling
benchloop run --model qwen3:8b --harness hermes # <tool_call>{...}</tool_call>
benchloop run --model qwen3:8b --harness qwen # <function_call>{...}</function_call>
benchloop run --model qwen3:8b --harness pi # <think>...</think> + Hermes tagsbenchloop run \
--model qwen3:8b \
--endpoint http://localhost:11435 \
--hardware "NVIDIA RTX 4090 24GB" \
--gpu "NVIDIA RTX 4090" \
--gpu-memory-gb 24Works with any OpenAI-compatible endpoint — DashScope, OpenRouter, Together, OpenAI, vLLM with auth, sglang, etc.
# Via environment variable
export OPENAI_API_KEY="sk-..."
benchloop run \
--model qwen3.7-max \
--provider openai_compat \
--endpoint https://dashscope-intl.aliyuncs.com/compatible-mode \
--remote
# Or inline
benchloop run \
--model gpt-4o \
--provider openai_compat \
--endpoint https://api.openai.com/v1 \
--api-key sk-... \
--remoteThe --remote flag (auto-detected for non-localhost endpoints) switches to cloud-aware scoring:
- Speed uses streaming TTFT (time-to-first-token) + effective content tok/s
- Overall = 0.50·quality + 0.25·speed + 0.25·reliability (vs local's 0.55/0.20/0.25)
- Reasoning models: content tok/s excludes internal thinking tokens
Required for vLLM, sglang, and most cloud providers. Two ways to provide it:
# 1. Environment variable (recommended)
export OPENAI_API_KEY="your-key-here"
benchloop run --model your-model --provider openai_compat --endpoint http://your-server:8000
# 2. CLI flag
benchloop run --model your-model --provider openai_compat --endpoint http://your-server:8000 --api-key your-key-hereThe CLI flag takes precedence over the env var. For Ollama and local providers without auth, neither is needed.
v0.2.0+ ships the full FastAPI + React dashboard inside the wheel. After pipx install benchloop-cli:
benchloop dashboard
# → open http://127.0.0.1:8877Need it to survive browser/terminal churn? Print a service template instead of keeping the dashboard tied to one shell:
benchloop dashboard --service-template launchd
benchloop dashboard --service-template systemd
benchloop dashboard --service-template windows-taskThis serves the Models, Benchmark, Leaderboard, Compare, and Chat tabs on a single port, with auto-discovered local providers (Ollama, LM Studio, MLX/Osaurus, vLLM, Jan).
For hot-reload development against a clone of bench-loop-web:
benchloop dashboard --dev| Suite | What it scores |
|---|---|
speed |
Latency, throughput, TTFT, generation tok/s across short/medium/long contexts |
toolcall |
Structured tool-call correctness across realistic tasks (weather, stocks, email, search) |
coding |
Executable Python tasks verified in a sandboxed subprocess (10s timeout) |
dataextract |
JSON / structured extraction from messy natural language |
instructfollow |
Constraint following, formatting, exactness |
reasonmath |
Small reasoning + math tasks with deterministic checks |
agent |
Multi-turn agentic tool use. BenchLoop drives a real loop: model emits a tool call, BenchLoop executes it locally, feeds the result back, model iterates until done. Scores correctness, efficiency, no-hallucination, required-tool coverage. |
Local: Overall = 0.55 · quality + 0.20 · speed + 0.25 · reliability
Cloud: Overall = 0.50 · quality + 0.25 · speed + 0.25 · reliability (with streaming speed data)
Overall = 0.65 · quality + 0.35 · reliability (no speed data)
- Quality = mean of non-speed suite scores (size-fair).
- Speed (local) =
12.54 · log2(tok/s) + 0.9, clamped to 0–100. - Speed (cloud) = 0.60 · TTFT_score + 0.40 · tok/s_score, where TTFT uses exponential decay (200ms→100, 2000ms→40) and tok/s uses a log curve calibrated for 20-150 tok/s.
- Reliability = pass rate across all tasks.
- Agent =
correct_final + efficient + no_hallucinated_tools + all_required_called, 25 pts each, averaged across tasks.
A FastAPI backend + React frontend bundle ships alongside the CLI for visualizing runs:
benchloop dashboard # starts the local web app on :5180Tabs: Models, Benchmark, Leaderboard, Compare runs, Chat, agent trace viewer.
Every completed benchmark auto-publishes to https://bench-loop.com/leaderboard via https://api.bench-loop.com/submit. Runs are deduped by (machine_id, run_id) so the same run from the same machine won't be double-counted.
Opt out:
export BENCHLOOP_NO_SUBMIT=1You can still manually export a snapshot for sharing / archiving:
benchloop export --output my-runs.jsonbench-loop/ ← this repo, the CLI + suites + scorers
bench_loop/
cli.py ← `benchloop` entrypoint
suites/ ← speed, toolcall, coding, agent, ...
harness.py ← raw / hermes / qwen / pi adapters
providers/ ← ollama, openai_compat
runner/orchestrator.py ← drives suites + harnesses
tasks/ ← frozen task YAML fixtures
bench-loop-web/ ← the web app (separate repo)
api/ ← FastAPI wrapper around bench_loop
ui/ ← local dashboard
site/ ← public bench-loop.com static site
BenchLoop is v0.2 beta. The benchmark surface, scoring, web app, agent loop, four harnesses, and cloud provider support all work end-to-end. Stuff still on the roadmap:
Streaming TTFT for OpenAI-compatible providers✅ (v0.2.3+ with--remote)- Bigger task fixtures (each suite is intentionally small and frozen for v1)
- Hosted submission flow for community runs
- Cloud-specific leaderboard on bench-loop.com (filter by local vs remote)
- More provider adapters (TGI, Bedrock, etc. if there's demand)
MIT. See LICENSE.
