A high-performance benchmarking tool for LLM inference endpoints, targeting 50k+ QPS. Part of MLCommons.
Requirements: Python 3.12+ (3.12 recommended)
git clone https://github.com/mlcommons/endpoints.git
cd endpoints
python3.12 -m venv venv && source venv/bin/activate
pip install .# Test endpoint connectivity
inference-endpoint probe \
--endpoints http://your-endpoint:8000 \
--model Qwen/Qwen3-8B
# Run offline benchmark (max throughput)
inference-endpoint benchmark offline \
--endpoints http://your-endpoint:8000 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl
# Run online benchmark (sustained QPS)
inference-endpoint benchmark online \
--endpoints http://your-endpoint:8000 \
--model Qwen/Qwen3-8B \
--dataset tests/datasets/dummy_1k.jsonl \
--load-pattern poisson \
--target-qps 100# Start local echo server and run a benchmark against it
python -m inference_endpoint.testing.echo_server --port 8765 &
inference-endpoint benchmark offline \
--endpoints http://localhost:8765 \
--model test-model \
--dataset tests/datasets/dummy_1k.jsonl
pkill -f echo_serverSee Local Testing Guide for more details.
Dataset Manager ──> Load Generator ──> Endpoint Client ──> External Endpoint
|
Metrics Collector (EventRecorder + MetricsReporter)
| Component | Purpose |
|---|---|
| Load Generator | Central orchestrator: BenchmarkSession owns lifecycle, Scheduler controls timing |
| Endpoint Client | Multi-process HTTP workers communicating via ZMQ IPC |
| Dataset Manager | Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets |
| Metrics | SQLite-backed event recording, aggregation (QPS, latency, TTFT, TPOT) |
| Config | Pydantic-based YAML schema, CLI auto-generated via cyclopts |
- Offline (
max_throughput): Burst all queries at once for peak throughput measurement - Online (
poisson): Fixed QPS with Poisson arrival distribution for latency profiling - Concurrency: Fixed concurrent request count
The hot path is optimized for minimal overhead:
- Multi-process workers with ZMQ IPC (not threads)
uvloop+eager_task_factoryfor async performancemsgspecfor zero-copy serialization on the data path- Custom HTTP connection pooling with
httptoolsparser - CPU affinity support for performance tuning
Run accuracy evaluation with Pass@1 scoring using pre-defined benchmarks:
- GPQA (default: GPQA Diamond)
- AIME (default: AIME 2025)
- LiveCodeBench (default: lite, release_v6) — requires additional setup
| Guide | Description |
|---|---|
| CLI Quick Reference | Command-line interface guide |
| CLI Design | CLI architecture and design decisions |
| Local Testing | Test with the echo server |
| Client Performance Tuning | Endpoint client optimization |
| Performance Architecture | Performance architecture deep dive |
| Development Guide | Development setup and workflow |
| CONTRIBUTING.md | How to contribute |
We welcome contributions from the community. See CONTRIBUTING.md for:
- Development setup and prerequisites
- Code style (ruff, mypy, conventional commits)
- Testing requirements (>90% coverage, pytest markers)
- Pull request process and review expectations
Issues are tracked on our project board. Look for good first issue or help wanted to get started.
This project draws inspiration from:
- MLCommons Inference — MLPerf Inference benchmark suite
- AIPerf — AI model performance profiling
- SGLang GenAI-Bench — Token-level performance evaluation
- vLLM Benchmarks — Performance benchmarking for vLLM
Apache License 2.0 — see LICENSE for details.