MLPerf Inference Endpoint Benchmarking System

A high-performance benchmarking tool for LLM inference endpoints, targeting 50k+ QPS. Part of MLCommons.

Quick Start

Requirements: Python 3.12+ (3.12 recommended)

git clone https://github.com/mlcommons/endpoints.git
cd endpoints
python3.12 -m venv venv && source venv/bin/activate
pip install .

# Test endpoint connectivity
inference-endpoint probe \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B

# Run offline benchmark (max throughput)
inference-endpoint benchmark offline \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl

# Run online benchmark (sustained QPS)
inference-endpoint benchmark online \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl \
  --load-pattern poisson \
  --target-qps 100

Local Testing

# Start local echo server and run a benchmark against it
python -m inference_endpoint.testing.echo_server --port 8765 &
inference-endpoint benchmark offline \
  --endpoints http://localhost:8765 \
  --model test-model \
  --dataset tests/datasets/dummy_1k.jsonl
pkill -f echo_server

See Local Testing Guide for more details.

Architecture

Dataset Manager ──> Load Generator ──> Endpoint Client ──> External Endpoint
                         |
                    Metrics Collector (EventRecorder + MetricsReporter)

Component	Purpose
Load Generator	Central orchestrator: `BenchmarkSession` owns lifecycle, `Scheduler` controls timing
Endpoint Client	Multi-process HTTP workers communicating via ZMQ IPC
Dataset Manager	Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets
Metrics	SQLite-backed event recording, aggregation (QPS, latency, TTFT, TPOT)
Config	Pydantic-based YAML schema, CLI auto-generated via cyclopts

Benchmark Modes

Offline (max_throughput): Burst all queries at once for peak throughput measurement
Online (poisson): Fixed QPS with Poisson arrival distribution for latency profiling
Concurrency: Fixed concurrent request count

Performance Design

The hot path is optimized for minimal overhead:

Multi-process workers with ZMQ IPC (not threads)
uvloop + eager_task_factory for async performance
msgspec for zero-copy serialization on the data path
Custom HTTP connection pooling with httptools parser
CPU affinity support for performance tuning

Accuracy Evaluation

Run accuracy evaluation with Pass@1 scoring using pre-defined benchmarks:

GPQA (default: GPQA Diamond)
AIME (default: AIME 2025)
LiveCodeBench (default: lite, release_v6) — requires additional setup

Documentation

Guide	Description
CLI Quick Reference	Command-line interface guide
CLI Design	CLI architecture and design decisions
Local Testing	Test with the echo server
Client Performance Tuning	Endpoint client optimization
Performance Architecture	Performance architecture deep dive
Development Guide	Development setup and workflow
CONTRIBUTING.md	How to contribute

Contributing

We welcome contributions from the community. See CONTRIBUTING.md for:

Development setup and prerequisites
Code style (ruff, mypy, conventional commits)
Testing requirements (>90% coverage, pytest markers)
Pull request process and review expectations

Issues are tracked on our project board. Look for good first issue or help wanted to get started.

Acknowledgements

This project draws inspiration from:

MLCommons Inference — MLPerf Inference benchmark suite
AIPerf — AI model performance profiling
SGLang GenAI-Bench — Token-level performance evaluation
vLLM Benchmarks — Performance benchmarking for vLLM

License

Apache License 2.0 — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
.claude/skills		.claude/skills
.github		.github
docs		docs
examples		examples
scripts		scripts
src/inference_endpoint		src/inference_endpoint
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
ATTRIBUTION		ATTRIBUTION
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLPerf Inference Endpoint Benchmarking System

Quick Start

Local Testing

Architecture

Benchmark Modes

Performance Design

Accuracy Evaluation

Documentation

Contributing

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MLPerf Inference Endpoint Benchmarking System

Quick Start

Local Testing

Architecture

Benchmark Modes

Performance Design

Accuracy Evaluation

Documentation

Contributing

Acknowledgements

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages