Skip to content

mlcommons/endpoints

Repository files navigation

MLPerf Inference Endpoint Benchmarking System

License Python 3.12+ Pre-commit

A high-performance benchmarking tool for LLM inference endpoints, targeting 50k+ QPS. Part of MLCommons.

Quick Start

Requirements: Python 3.12+ (3.12 recommended)

git clone https://github.com/mlcommons/endpoints.git
cd endpoints
python3.12 -m venv venv && source venv/bin/activate
pip install .
# Test endpoint connectivity
inference-endpoint probe \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B

# Run offline benchmark (max throughput)
inference-endpoint benchmark offline \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl

# Run online benchmark (sustained QPS)
inference-endpoint benchmark online \
  --endpoints http://your-endpoint:8000 \
  --model Qwen/Qwen3-8B \
  --dataset tests/datasets/dummy_1k.jsonl \
  --load-pattern poisson \
  --target-qps 100

Local Testing

# Start local echo server and run a benchmark against it
python -m inference_endpoint.testing.echo_server --port 8765 &
inference-endpoint benchmark offline \
  --endpoints http://localhost:8765 \
  --model test-model \
  --dataset tests/datasets/dummy_1k.jsonl
pkill -f echo_server

See Local Testing Guide for more details.

Architecture

Dataset Manager ──> Load Generator ──> Endpoint Client ──> External Endpoint
                         |
                    Metrics Collector (EventRecorder + MetricsReporter)
Component Purpose
Load Generator Central orchestrator: BenchmarkSession owns lifecycle, Scheduler controls timing
Endpoint Client Multi-process HTTP workers communicating via ZMQ IPC
Dataset Manager Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets
Metrics SQLite-backed event recording, aggregation (QPS, latency, TTFT, TPOT)
Config Pydantic-based YAML schema, CLI auto-generated via cyclopts

Benchmark Modes

  • Offline (max_throughput): Burst all queries at once for peak throughput measurement
  • Online (poisson): Fixed QPS with Poisson arrival distribution for latency profiling
  • Concurrency: Fixed concurrent request count

Performance Design

The hot path is optimized for minimal overhead:

  • Multi-process workers with ZMQ IPC (not threads)
  • uvloop + eager_task_factory for async performance
  • msgspec for zero-copy serialization on the data path
  • Custom HTTP connection pooling with httptools parser
  • CPU affinity support for performance tuning

Accuracy Evaluation

Run accuracy evaluation with Pass@1 scoring using pre-defined benchmarks:

  • GPQA (default: GPQA Diamond)
  • AIME (default: AIME 2025)
  • LiveCodeBench (default: lite, release_v6) — requires additional setup

Documentation

Guide Description
CLI Quick Reference Command-line interface guide
CLI Design CLI architecture and design decisions
Local Testing Test with the echo server
Client Performance Tuning Endpoint client optimization
Performance Architecture Performance architecture deep dive
Development Guide Development setup and workflow
CONTRIBUTING.md How to contribute

Contributing

We welcome contributions from the community. See CONTRIBUTING.md for:

  • Development setup and prerequisites
  • Code style (ruff, mypy, conventional commits)
  • Testing requirements (>90% coverage, pytest markers)
  • Pull request process and review expectations

Issues are tracked on our project board. Look for good first issue or help wanted to get started.

Acknowledgements

This project draws inspiration from:

License

Apache License 2.0 — see LICENSE for details.

About

MLCommons Inference Endpoints repository

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors