Skip to content

Commit 1ac36c2

Browse files
committed
add docs/metrics.md, now metrics synced with doc
1 parent 000279a commit 1ac36c2

36 files changed

+3301
-4826
lines changed

Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ generate:
1212
bench:
1313
@echo "🆕 Starting La Perf benchmark"
1414
@uv run python main.py
15-
@echo "✨ Done! Run 'make' to update results in README.md"
15+
@echo "✨ Done! Run 'make generate' to update results in README.md"
1616

1717
# Run pre-commit hooks on all files
1818
format:
@@ -22,7 +22,7 @@ format:
2222
# Run linting only (ruff)
2323
lint:
2424
@echo "🔍 Running ruff linter..."
25-
@uv run ruff check src/ main.py
25+
@uvx ruff check src/ main.py
2626

2727
# Clean Python cache files
2828
clean:

README.md

Lines changed: 35 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -33,27 +33,28 @@ It’s designed for **AI/ML engineers** who prefer to run workloads locally —
3333
## Overview
3434
### Tasks
3535
La Perf is a collection of reproducible tests and community-submitted results for :
36-
- #### 🧩 **Embeddings** — ✅ Ready (sentence-transformers, [IMDB dataset](https://huggingface.co/datasets/stanfordnlp/imdb))
36+
- #### **Embeddings** — ✅ Ready (sentence-transformers, [IMDB dataset](https://huggingface.co/datasets/stanfordnlp/imdb))
3737
sts models:
3838
- [thenlper/gte-large](https://huggingface.co/thenlper/gte-large)
3939
- [modernbert-embed-base](https://huggingface.co/nomic-ai/modernbert-embed-base)
40-
- #### 💬 **LLM inference** — ✅ Ready (LM Studio and Ollama, [Awesome Prompts dataset](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts))
40+
- #### **LLM inference** — ✅ Ready (LM Studio and Ollama, [Awesome Prompts dataset](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts))
4141
llm models:
4242
- **LM Studio**: [gpt-oss-20b](https://lmstudio.ai/models/openai/gpt-oss-20b)
4343
- *macOS*: `mlx-community/gpt-oss-20b-MXFP4-Q8` (MLX MXFP4-Q8)
4444
- *Other platforms*: `lmstudio-community/gpt-oss-20b-GGUF` (GGUF)
4545
- **Ollama**: [gpt-oss-20b](https://ollama.com/library/gpt-oss:20b)
46-
-
47-
- #### 👁️ **VLM inference** — ✅ Ready (LM Studio and Ollama, [Hallucination_COCO dataset](https://huggingface.co/datasets/DogNeverSleep/Hallucination_COCO))
46+
47+
48+
- #### **VLM inference** — ✅ Ready (LM Studio and Ollama, [Hallucination_COCO dataset](https://huggingface.co/datasets/DogNeverSleep/Hallucination_COCO))
4849
vlm models:
4950
- **LM Studio**: [Qwen3-VL-8B-Instruct](https://lmstudio.ai/models/qwen/qwen3-vl-8b)
50-
- *macOS*: `lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit` (MLX 8-bit)
51+
- *macOS*: `lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit` (MLX 4-bit)
5152
- *Other platforms*: `lmstudio-community/Qwen3-VL-8B-Instruct-GGUF-Q4_K_M` (Q4_K_M)
5253
- **Ollama**: [qwen3-vl:8b](https://ollama.com/library/qwen3-vl:8b)
5354
- **all platforms**: `qwen3-vl:8b` (Q4_K_M)
54-
- #### 🎨 **Diffusion image generation** — 📋 Planned
55-
- #### 🗣️ **Speach to Text** - 📋 Planned (whisper)
56-
- #### 🔬 **Classic ML** — 📋 Planned (scikit-learn, XGBoost, LightGBM, Catboost)
55+
- #### **Diffusion image generation** — 📋 Planned
56+
- #### **Speach to Text** - 📋 Planned (whisper)
57+
- #### **Classic ML** — 📋 Planned (scikit-learn, XGBoost, LightGBM, Catboost)
5758

5859
**Note For mac-users**: If it's possible prefer to use lmstudio with `mlx` backend, which gives 10-20% more performance then `gguf`. If you run ollama (by default benchmarks runs both lmstudio and ollama) then you'll see a difference between `mlx` and `gguf` formats.
5960

@@ -86,13 +87,12 @@ NoBS was built to understand how different devices — from everyday laptops and
8687

8788
## Benchmark Results
8889

89-
> **Last Updated**: 2025-11-05
90+
> **Last Updated**: 2025-11-07
9091
### 🏆 Overall Ranking
9192

9293
| Rank | Device | Platform | CPU | RAM | GPU | VRAM | Embeddings, sts (s) | LLM, lms (s) | LLM, ollama (s) | VLM, lms (s) | VLM, ollama (s) | Total Time (s) |
9394
|------|------|------|------|------|------|------|------|------|------|------|------|------|
94-
| 🥇 1 | Mac16,6 | 🍏 macOS | Apple M4 Max (14) | 36 GB | Apple M4 Max (32 cores) | shared with system RAM | 52.92 | 1.02 | 15.99 | 10.57 | 33.69 | **114.19** |
95-
| 🥈 2 | ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | 🐧 Linux | Intel(R) Core(TM) Ultra 9 185H (16) | 23 GB | NVIDIA GeForce RTX 4060 Laptop GPU | 8 GB | 19.99 | 7.60 | 30.22 | 25.58 | 127.01 | **210.40** |
95+
| 🥇 1 | Mac16,6 | 🍏 macOS | Apple M4 Max (14) | 36 GB | Apple M4 Max (32 cores) | shared with system RAM | 53.76 | 1.28 | 4.64 | 11.24 | 33.09 | **104.01** |
9696

9797
*sts - sentence transformers*
9898

@@ -106,21 +106,19 @@ NoBS was built to understand how different devices — from everyday laptops and
106106

107107
| Device | CPU Usage (p50/p95) | RAM Used (p50/p95) | GPU Usage (p50/p95) | GPU Temp (p50/p95) | Battery (start/end/Δ) | GPU Power (p50/p95) | CPU Power (p50/p95) |
108108
|------|------|------|------|------|------|------|------|
109-
| ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | 27.1% / 29.9% | 10.6GB / 13.4GB | 12.0% / 35.0% | 65.0°C / 66.0°C | 72.0% / 100.0% / -28.0% | 18.1W / 41.9W | 18.1W / 41.9W |
110-
| Mac16,6 | 4.6% / 9.4% | 20.9GB / 22.4GB | 97.0% / 100.0% | N/A | 65% / 8% / +57.0% | 11.7W / 36.0W | 1.4W / 2.8W |
109+
| Mac16,6 | 4.0% / 12.0% | 22.3GB / 23.9GB | 97.0% / 100.0% | N/A | 85% / 85% / +0.0% | 11.7W / 32.3W | 1.1W / 2.2W |
111110

112111
*p50 = median, p95 = 95th percentile*
113112

114113

115114

116115
### Embeddings
117116

118-
#### Text Embeddings (100 IMDB samples)
117+
#### Text Embeddings (3000 IMDB samples)
119118

120-
| Device | Model | Rows/sec | Time (s) | Embedding Dim | Batch Size |
119+
| Device | Model | RPS (mean ± std) | Time (s) (mean ± std) | Embedding Dim | Batch Size |
121120
|------|------|------|------|------|------|
122-
| ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | nomic-ai/modernbert-embed-base | 150.06 ± 0.39 | 19.99 ± 0.05 | 768 | 32 |
123-
| Mac16,6 | nomic-ai/modernbert-embed-base | 56.69 ± 0.29 | 52.92 ± 0.27 | 768 | 32 |
121+
| Mac16,6 | nomic-ai/modernbert-embed-base | 55.81 ± 0.75 | 53.76 ± 0.72 | 768 | 32 |
124122

125123
![Embeddings Performance Profile](results/plots/embeddings_performance.png)
126124

@@ -129,22 +127,20 @@ NoBS was built to understand how different devices — from everyday laptops and
129127

130128
### LLMs
131129

132-
#### LLM Inference (3 prompts from awesome-chatgpt-prompts)
130+
#### LLM Inference (10 prompts from awesome-chatgpt-prompts)
133131

134132

135133
**LM STUDIO**
136134

137-
| Device | Model | E2E TPS | TTFT (s) | TG (s) | E2E Latency (s) | Input Tokens | Output Tokens |
138-
|------|------|------|------|------|------|------|------|
139-
| ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | openai/gpt-oss-20b | 13.10 ± 0.94 | 3.64 ± 0.51 | 1.67 ± 0.09 | 7.60 ± 1.19 | 1728 | 3978 |
140-
| Mac16,6 | openai/gpt-oss-20b | 70.83 ± 1.61 | 0.75 ± 0.01 | 0.23 ± 0.00 | 1.02 ± 0.02 | 1728 | 3968 |
135+
| Device | Model | TPS P50 | TPS P95 | TTFT P50 (s) | TTFT P95 (s) | TG P50 (s) | TG P95 (s) | Latency P50 (s) | Latency P95 (s) | Input Tokens | Output Tokens |
136+
|------|------|------|------|------|------|------|------|------|------|------|------|
137+
| Mac16,6 | openai/gpt-oss-20b | 56.53 ± 1.65 | 77.21 ± 1.99 | 0.92 ± 0.02 | 1.23 ± 0.03 | 0.24 ± 0.00 | 17.09 ± 0.57 | 1.28 ± 0.04 | 18.28 ± 0.60 | 1728 | 3906 |
141138

142139
**OLLAMA**
143140

144-
| Device | Model | E2E TPS | TTFT (s) | TG (s) | E2E Latency (s) | Input Tokens | Output Tokens |
145-
|------|------|------|------|------|------|------|------|
146-
| ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | gpt-oss:20b | 13.11 ± 0.35 | 21.03 ± 0.97 | 2.47 ± 0.12 | 30.22 ± 1.68 | 1728 | 10036 |
147-
| Mac16,6 | gpt-oss:20b | 64.21 ± 0.20 | 8.83 ± 0.05 | 0.32 ± 0.00 | 15.99 ± 0.05 | 1728 | 12159 |
141+
| Device | Model | TPS P50 | TPS P95 | TTFT P50 (s) | TTFT P95 (s) | TG P50 (s) | TG P95 (s) | Latency P50 (s) | Latency P95 (s) | Input Tokens | Output Tokens |
142+
|------|------|------|------|------|------|------|------|------|------|------|------|
143+
| Mac16,6 | gpt-oss:20b | 61.03 ± 4.29 | 63.50 ± 6.07 | 4.18 ± 0.31 | 56.83 ± 0.82 | 0.46 ± 0.04 | 25.17 ± 0.33 | 4.64 ± 0.35 | 79.54 ± 0.91 | 1728 | 12939 |
148144

149145
![LLM TTFT vs Input Tokens](results/plots/llm_ttft_vs_input_tokens.png)
150146

@@ -155,34 +151,32 @@ NoBS was built to understand how different devices — from everyday laptops and
155151

156152
*Generation time growth relative to output length. Lower values reflect faster completions.*
157153

158-
![LLM TTFT Performance](results/plots/llm_ttft.png)
154+
![LLM E2E Latency Performance](results/plots/llm_latency.png)
159155

160-
*Time To First Token (TTFT) - Lower is better. Measures response latency.*
156+
*End-to-End Latency P50 - Lower is better. Measures full request-to-response time.*
161157

162158

163159
![LLM Throughput Performance](results/plots/llm_tps.png)
164160

165-
*Token Generation per second (TG) - Higher is better. Measures token generation.*
161+
*Token Generation per second (TPS) - Higher is better. Measures token generation speed.*
166162

167163

168164
### VLMs
169165

170-
#### VLM Inference (3 questions from Hallucination_COCO)
166+
#### VLM Inference (10 questions from Hallucination_COCO)
171167

172168

173169
**LM STUDIO**
174170

175-
| Device | Model | E2E TPS | TTFT (s) | TG (s) | E2E Latency (s) | Input Tokens | Output Tokens |
176-
|------|------|------|------|------|------|------|------|
177-
| ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | qwen/qwen3-vl-8b | 20.20 ± 0.06 | 0.79 ± 0.06 | 24.75 ± 0.07 | 25.58 ± 0.10 | 290 | 5128 |
178-
| Mac16,6 | qwen/qwen3-vl-8b | 54.27 ± 1.66 | 1.55 ± 0.06 | 9.04 ± 0.43 | 10.57 ± 0.45 | 310 | 6043 |
171+
| Device | Model | TPS P50 | TPS P95 | TTFT P50 (s) | TTFT P95 (s) | TG P50 (s) | TG P95 (s) | Latency P50 (s) | Latency P95 (s) | Input Tokens | Output Tokens |
172+
|------|------|------|------|------|------|------|------|------|------|------|------|
173+
| Mac16,6 | qwen/qwen3-vl-8b | 51.47 ± 1.30 | 53.62 ± 1.82 | 1.58 ± 0.01 | 1.77 ± 0.07 | 9.62 ± 0.48 | 13.42 ± 0.37 | 11.24 ± 0.48 | 15.06 ± 0.30 | 310 | 5966 |
179174

180175
**OLLAMA**
181176

182-
| Device | Model | E2E TPS | TTFT (s) | TG (s) | E2E Latency (s) | Input Tokens | Output Tokens |
183-
|------|------|------|------|------|------|------|------|
184-
| ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | qwen3-vl:8b | 12.00 ± 0.19 | 64.86 ± 4.15 | 66.52 ± 0.54 | 127.01 ± 3.20 | 1814 | 14636 |
185-
| Mac16,6 | qwen3-vl:8b | 46.47 ± 0.52 | 16.86 ± 0.21 | 17.17 ± 0.17 | 33.69 ± 0.54 | 1814 | 15516 |
177+
| Device | Model | TPS P50 | TPS P95 | TTFT P50 (s) | TTFT P95 (s) | TG P50 (s) | TG P95 (s) | Latency P50 (s) | Latency P95 (s) | Input Tokens | Output Tokens |
178+
|------|------|------|------|------|------|------|------|------|------|------|------|
179+
| Mac16,6 | qwen3-vl:8b | 47.78 ± 4.93 | 49.61 ± 6.79 | 15.29 ± 1.24 | 27.64 ± 0.60 | 16.28 ± 0.91 | 19.59 ± 1.52 | 33.09 ± 3.44 | 44.33 ± 0.41 | 1814 | 15490 |
186180

187181
![VLM TTFT vs Input Tokens](results/plots/vlm_ttft_vs_input_tokens.png)
188182

@@ -193,14 +187,14 @@ NoBS was built to understand how different devices — from everyday laptops and
193187

194188
*Generation time vs output token count for multimodal responses. Lower values are faster.*
195189

196-
![VLM TTFT Performance](results/plots/vlm_ttft.png)
190+
![VLM E2E Latency Performance](results/plots/vlm_latency.png)
197191

198-
*Time To First Token (TTFT) - Lower is better. Measures response latency.*
192+
*End-to-End Latency P50 - Lower is better. Measures full request-to-response time.*
199193

200194

201195
![VLM Throughput Performance](results/plots/vlm_tps.png)
202196

203-
*Token Generation per second (TG) - Higher is better. Measures token generation.*
197+
*Token Generation per second (TPS) - Higher is better. Measures token generation speed.*
204198

205199

206200
---

docs/metrics.md

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# Metrics
2+
3+
This section describes how La Perf calculates and evaluates metrics across different benchmark tasks.
4+
5+
## Embeddings
6+
7+
### Overview
8+
Embedding benchmarks use the `sentence-transformers` library for encoding operations.
9+
10+
| Metric | Description | Unit |
11+
|--------|-------------|------|
12+
| **E2E Latency** | Total time to encode full dataset | seconds |
13+
| **RPS** | Rows Per Second (throughput) | rows/s |
14+
15+
### Measurement Methodology
16+
The total encoding latency is measured around the `.encode()` call, which internally handles batching.
17+
Each run includes device synchronization before and after encoding to ensure accurate timing.
18+
19+
**Implementation details:**
20+
- Uses `torch.cuda.synchronize()` for NVIDIA GPUs
21+
- Uses `torch.mps.synchronize()` for Apple Silicon GPUs
22+
- Ensures complete device-side execution before measurement
23+
24+
### Cross-Run Statistics
25+
For multiple benchmark runs, simple mean and standard deviation are calculated:
26+
```python
27+
mean(run1, run2, run3) ± std(run1, run2, run3)
28+
29+
# Example with RPS:
30+
# final_mean_rps = mean([run1_rps, run2_rps, run3_rps])
31+
# final_std_rps = std([run1_rps, run2_rps, run3_rps])
32+
# On table you see: final_mean_rps ± final_std_rps
33+
```
34+
35+
**Note:** Embeddings use direct mean/std across runs, not percentile-based statistics.
36+
37+
---
38+
39+
40+
## LLMs & VLMs
41+
42+
### Overview
43+
44+
| Metric | Description | Unit |
45+
|--------|-------------|------|
46+
| **TTFT** | Time To First Token — prompt processing latency | seconds |
47+
| **TG** | Token Generation — time spent generating output | seconds |
48+
| **TPS** | Tokens Per Second — generation throughput | tokens/s |
49+
| **E2E Latency** | End-to-end request latency | seconds |
50+
51+
### Measurement Methodology
52+
53+
#### Streaming & Token Counting
54+
55+
La Perf uses streaming APIs (Ollama, LM Studio via OpenAI SDK) to measure both latency and throughput.
56+
57+
**Critical distinction:** API chunks ≠ tokens
58+
59+
The server sends responses in chunks, but each chunk may contain multiple tokens. Token counts are obtained from server-side usage statistics.
60+
61+
#### Per-Request Measurements
62+
63+
For each prompt in the benchmark:
64+
65+
| Timestamp | Description |
66+
|-----------|-------------|
67+
| `t0_stream` | Request start time |
68+
| `first_token_ts` | First chunk received (≈ first token) |
69+
| `t1_stream` | Response complete |
70+
71+
| Token Count | Source |
72+
|-------------|--------|
73+
| `input_tokens` | From server usage stats |
74+
| `output_tokens` | From server usage stats |
75+
76+
#### Metric Calculations
77+
78+
| Metric | Formula | Notes |
79+
|--------|---------|-------|
80+
| **E2E Latency** | `t1_stream - t0_stream` | Total request time |
81+
| **TTFT** | `first_token_ts - t0_stream` | Prompt processing time |
82+
| **TG** | `t1_stream - first_token_ts` | Generation phase time |
83+
| **TPS** | `output_tokens / E2E Latency` | Client-side throughput metric |
84+
85+
#### Why TPS = output_tokens / E2E Latency?
86+
87+
**Incorrect approach:**
88+
```python
89+
TPS = output_tokens / TG # ❌ WRONG
90+
# Example: 38 tokens / 0.0007s = 52 285.714 tokens/sec
91+
```
92+
> Fifty-two thousand tokens per second? Goodbye H100, my local PC just destroyed you!
93+
94+
Yeah, no. This calculation is hilariously wrong.
95+
96+
This vastly overestimates performance because `TG` measures only the time between first and last chunk, not the actual token generation time.
97+
98+
**Correct approach:**
99+
```python
100+
TPS = output_tokens / E2E Latency # ✅ CORRECT
101+
# Example: 38 tokens / 0.6668s = 56.988 tokens/sec
102+
```
103+
104+
This reflects real-world throughput from the client perspective.
105+
106+
**Limitation:** For very short outputs (1-2 chunks), `TG` may not accurately represent generation time. Server-side metrics would be more precise but are not currently collected.
107+
108+
---
109+
110+
### Per-Metric Percentiles
111+
For each metric across all requests, La Perf computes:
112+
113+
| Percentile | Description |
114+
|------------|-------------|
115+
| **P25** | 25th percentile |
116+
| **P50** | Median |
117+
| **P75** | 75th percentile |
118+
| **P95** | 95th percentile |
119+
120+
### Cross-Run Statistics
121+
For multiple benchmark runs, statistics are calculated from percentile values across runs:
122+
```python
123+
mean(run1_percentile, run2_percentile, run3_percentile) ± std(run1_percentile, run2_percentile, run3_percentile)
124+
125+
# Example with P50 TPS:
126+
# final_p50_tps = mean([run1_p50_tps, run2_p50_tps, run3_p50_tps])
127+
# final_p50_tps_std = std([run1_p50_tps, run2_p50_tps, run3_p50_tps])
128+
# On table you see: final_p50_tps ± final_p50_tps_std
129+
```
130+
131+
**Note:** LLM/VLM compute percentiles per run first, then aggregate across runs. This differs from Embeddings which use direct mean/std.
132+
133+
These aggregated values appear in the results tables.
134+
135+
---
136+
### Notes
137+
- All timing values are wall-clock times measured via `time.perf_counter()`.
138+
- Benchmarks are repeated at least 3 times to compute mean and standard deviation.
139+
- All metrics are device-synchronized and exclude warmup runs.

main.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,3 @@
1-
import os
2-
3-
os.environ["TOKENIZERS_PARALLELISM"] = "false"
4-
51
from src.system_info.device_info import get_device_info
62
from src.system_info.power_metrics import PowerMonitor
73
from src.cli import display_device_info, display_final_summary, select_benchmarks
-53.9 KB
Loading

results/plots/llm_latency.png

140 KB
Loading
-163 KB
Loading

results/plots/llm_tps.png

-53.1 KB
Loading

results/plots/llm_ttft.png

-190 KB
Binary file not shown.
-127 KB
Loading

0 commit comments

Comments
 (0)