@@ -33,27 +33,28 @@ It’s designed for **AI/ML engineers** who prefer to run workloads locally —
3333## Overview
3434### Tasks
3535La Perf is a collection of reproducible tests and community-submitted results for :
36- - #### 🧩 ** Embeddings** — ✅ Ready (sentence-transformers, [ IMDB dataset] ( https://huggingface.co/datasets/stanfordnlp/imdb ) )
36+ - #### ** Embeddings** — ✅ Ready (sentence-transformers, [ IMDB dataset] ( https://huggingface.co/datasets/stanfordnlp/imdb ) )
3737 sts models:
3838 - [ thenlper/gte-large] ( https://huggingface.co/thenlper/gte-large )
3939 - [ modernbert-embed-base] ( https://huggingface.co/nomic-ai/modernbert-embed-base )
40- - #### 💬 ** LLM inference** — ✅ Ready (LM Studio and Ollama, [ Awesome Prompts dataset] ( https://huggingface.co/datasets/fka/awesome-chatgpt-prompts ) )
40+ - #### ** LLM inference** — ✅ Ready (LM Studio and Ollama, [ Awesome Prompts dataset] ( https://huggingface.co/datasets/fka/awesome-chatgpt-prompts ) )
4141 llm models:
4242 - ** LM Studio** : [ gpt-oss-20b] ( https://lmstudio.ai/models/openai/gpt-oss-20b )
4343 - * macOS* : ` mlx-community/gpt-oss-20b-MXFP4-Q8 ` (MLX MXFP4-Q8)
4444 - * Other platforms* : ` lmstudio-community/gpt-oss-20b-GGUF ` (GGUF)
4545 - ** Ollama** : [ gpt-oss-20b] ( https://ollama.com/library/gpt-oss:20b )
46- -
47- - #### 👁️ ** VLM inference** — ✅ Ready (LM Studio and Ollama, [ Hallucination_COCO dataset] ( https://huggingface.co/datasets/DogNeverSleep/Hallucination_COCO ) )
46+
47+
48+ - #### ** VLM inference** — ✅ Ready (LM Studio and Ollama, [ Hallucination_COCO dataset] ( https://huggingface.co/datasets/DogNeverSleep/Hallucination_COCO ) )
4849 vlm models:
4950 - ** LM Studio** : [ Qwen3-VL-8B-Instruct] ( https://lmstudio.ai/models/qwen/qwen3-vl-8b )
50- - * macOS* : ` lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit ` (MLX 8 -bit)
51+ - * macOS* : ` lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit ` (MLX 4 -bit)
5152 - * Other platforms* : ` lmstudio-community/Qwen3-VL-8B-Instruct-GGUF-Q4_K_M ` (Q4_K_M)
5253 - ** Ollama** : [ qwen3-vl:8b] ( https://ollama.com/library/qwen3-vl:8b )
5354 - ** all platforms** : ` qwen3-vl:8b ` (Q4_K_M)
54- - #### 🎨 ** Diffusion image generation** — 📋 Planned
55- - #### 🗣️ ** Speach to Text** - 📋 Planned (whisper)
56- - #### 🔬 ** Classic ML** — 📋 Planned (scikit-learn, XGBoost, LightGBM, Catboost)
55+ - #### ** Diffusion image generation** — 📋 Planned
56+ - #### ** Speach to Text** - 📋 Planned (whisper)
57+ - #### ** Classic ML** — 📋 Planned (scikit-learn, XGBoost, LightGBM, Catboost)
5758
5859** Note For mac-users** : If it's possible prefer to use lmstudio with ` mlx ` backend, which gives 10-20% more performance then ` gguf ` . If you run ollama (by default benchmarks runs both lmstudio and ollama) then you'll see a difference between ` mlx ` and ` gguf ` formats.
5960
@@ -86,13 +87,12 @@ NoBS was built to understand how different devices — from everyday laptops and
8687
8788## Benchmark Results
8889
89- > ** Last Updated** : 2025-11-05
90+ > ** Last Updated** : 2025-11-07
9091### 🏆 Overall Ranking
9192
9293| Rank | Device | Platform | CPU | RAM | GPU | VRAM | Embeddings, sts (s) | LLM, lms (s) | LLM, ollama (s) | VLM, lms (s) | VLM, ollama (s) | Total Time (s) |
9394| ------| ------| ------| ------| ------| ------| ------| ------| ------| ------| ------| ------| ------|
94- | 🥇 1 | Mac16,6 | 🍏 macOS | Apple M4 Max (14) | 36 GB | Apple M4 Max (32 cores) | shared with system RAM | 52.92 | 1.02 | 15.99 | 10.57 | 33.69 | ** 114.19** |
95- | 🥈 2 | ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | 🐧 Linux | Intel(R) Core(TM) Ultra 9 185H (16) | 23 GB | NVIDIA GeForce RTX 4060 Laptop GPU | 8 GB | 19.99 | 7.60 | 30.22 | 25.58 | 127.01 | ** 210.40** |
95+ | 🥇 1 | Mac16,6 | 🍏 macOS | Apple M4 Max (14) | 36 GB | Apple M4 Max (32 cores) | shared with system RAM | 53.76 | 1.28 | 4.64 | 11.24 | 33.09 | ** 104.01** |
9696
9797* sts - sentence transformers*
9898
@@ -106,21 +106,19 @@ NoBS was built to understand how different devices — from everyday laptops and
106106
107107| Device | CPU Usage (p50/p95) | RAM Used (p50/p95) | GPU Usage (p50/p95) | GPU Temp (p50/p95) | Battery (start/end/Δ) | GPU Power (p50/p95) | CPU Power (p50/p95) |
108108| ------| ------| ------| ------| ------| ------| ------| ------|
109- | ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | 27.1% / 29.9% | 10.6GB / 13.4GB | 12.0% / 35.0% | 65.0°C / 66.0°C | 72.0% / 100.0% / -28.0% | 18.1W / 41.9W | 18.1W / 41.9W |
110- | Mac16,6 | 4.6% / 9.4% | 20.9GB / 22.4GB | 97.0% / 100.0% | N/A | 65% / 8% / +57.0% | 11.7W / 36.0W | 1.4W / 2.8W |
109+ | Mac16,6 | 4.0% / 12.0% | 22.3GB / 23.9GB | 97.0% / 100.0% | N/A | 85% / 85% / +0.0% | 11.7W / 32.3W | 1.1W / 2.2W |
111110
112111* p50 = median, p95 = 95th percentile*
113112
114113
115114
116115### Embeddings
117116
118- #### Text Embeddings (100 IMDB samples)
117+ #### Text Embeddings (3000 IMDB samples)
119118
120- | Device | Model | Rows/sec | Time (s) | Embedding Dim | Batch Size |
119+ | Device | Model | RPS (mean ± std) | Time (s) (mean ± std ) | Embedding Dim | Batch Size |
121120| ------| ------| ------| ------| ------| ------|
122- | ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | nomic-ai/modernbert-embed-base | 150.06 ± 0.39 | 19.99 ± 0.05 | 768 | 32 |
123- | Mac16,6 | nomic-ai/modernbert-embed-base | 56.69 ± 0.29 | 52.92 ± 0.27 | 768 | 32 |
121+ | Mac16,6 | nomic-ai/modernbert-embed-base | 55.81 ± 0.75 | 53.76 ± 0.72 | 768 | 32 |
124122
125123![ Embeddings Performance Profile] ( results/plots/embeddings_performance.png )
126124
@@ -129,22 +127,20 @@ NoBS was built to understand how different devices — from everyday laptops and
129127
130128### LLMs
131129
132- #### LLM Inference (3 prompts from awesome-chatgpt-prompts)
130+ #### LLM Inference (10 prompts from awesome-chatgpt-prompts)
133131
134132
135133** LM STUDIO**
136134
137- | Device | Model | E2E TPS | TTFT (s) | TG (s) | E2E Latency (s) | Input Tokens | Output Tokens |
138- | ------| ------| ------| ------| ------| ------| ------| ------|
139- | ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | openai/gpt-oss-20b | 13.10 ± 0.94 | 3.64 ± 0.51 | 1.67 ± 0.09 | 7.60 ± 1.19 | 1728 | 3978 |
140- | Mac16,6 | openai/gpt-oss-20b | 70.83 ± 1.61 | 0.75 ± 0.01 | 0.23 ± 0.00 | 1.02 ± 0.02 | 1728 | 3968 |
135+ | Device | Model | TPS P50 | TPS P95 | TTFT P50 (s) | TTFT P95 (s) | TG P50 (s) | TG P95 (s) | Latency P50 (s) | Latency P95 (s) | Input Tokens | Output Tokens |
136+ | ------| ------| ------| ------| ------| ------| ------| ------| ------| ------| ------| ------|
137+ | Mac16,6 | openai/gpt-oss-20b | 56.53 ± 1.65 | 77.21 ± 1.99 | 0.92 ± 0.02 | 1.23 ± 0.03 | 0.24 ± 0.00 | 17.09 ± 0.57 | 1.28 ± 0.04 | 18.28 ± 0.60 | 1728 | 3906 |
141138
142139** OLLAMA**
143140
144- | Device | Model | E2E TPS | TTFT (s) | TG (s) | E2E Latency (s) | Input Tokens | Output Tokens |
145- | ------| ------| ------| ------| ------| ------| ------| ------|
146- | ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | gpt-oss:20b | 13.11 ± 0.35 | 21.03 ± 0.97 | 2.47 ± 0.12 | 30.22 ± 1.68 | 1728 | 10036 |
147- | Mac16,6 | gpt-oss:20b | 64.21 ± 0.20 | 8.83 ± 0.05 | 0.32 ± 0.00 | 15.99 ± 0.05 | 1728 | 12159 |
141+ | Device | Model | TPS P50 | TPS P95 | TTFT P50 (s) | TTFT P95 (s) | TG P50 (s) | TG P95 (s) | Latency P50 (s) | Latency P95 (s) | Input Tokens | Output Tokens |
142+ | ------| ------| ------| ------| ------| ------| ------| ------| ------| ------| ------| ------|
143+ | Mac16,6 | gpt-oss:20b | 61.03 ± 4.29 | 63.50 ± 6.07 | 4.18 ± 0.31 | 56.83 ± 0.82 | 0.46 ± 0.04 | 25.17 ± 0.33 | 4.64 ± 0.35 | 79.54 ± 0.91 | 1728 | 12939 |
148144
149145![ LLM TTFT vs Input Tokens] ( results/plots/llm_ttft_vs_input_tokens.png )
150146
@@ -155,34 +151,32 @@ NoBS was built to understand how different devices — from everyday laptops and
155151
156152* Generation time growth relative to output length. Lower values reflect faster completions.*
157153
158- ![ LLM TTFT Performance] ( results/plots/llm_ttft .png )
154+ ![ LLM E2E Latency Performance] ( results/plots/llm_latency .png )
159155
160- * Time To First Token (TTFT) - Lower is better. Measures response latency .*
156+ * End-to-End Latency P50 - Lower is better. Measures full request-to- response time .*
161157
162158
163159![ LLM Throughput Performance] ( results/plots/llm_tps.png )
164160
165- * Token Generation per second (TG ) - Higher is better. Measures token generation.*
161+ * Token Generation per second (TPS ) - Higher is better. Measures token generation speed .*
166162
167163
168164### VLMs
169165
170- #### VLM Inference (3 questions from Hallucination_COCO)
166+ #### VLM Inference (10 questions from Hallucination_COCO)
171167
172168
173169** LM STUDIO**
174170
175- | Device | Model | E2E TPS | TTFT (s) | TG (s) | E2E Latency (s) | Input Tokens | Output Tokens |
176- | ------| ------| ------| ------| ------| ------| ------| ------|
177- | ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | qwen/qwen3-vl-8b | 20.20 ± 0.06 | 0.79 ± 0.06 | 24.75 ± 0.07 | 25.58 ± 0.10 | 290 | 5128 |
178- | Mac16,6 | qwen/qwen3-vl-8b | 54.27 ± 1.66 | 1.55 ± 0.06 | 9.04 ± 0.43 | 10.57 ± 0.45 | 310 | 6043 |
171+ | Device | Model | TPS P50 | TPS P95 | TTFT P50 (s) | TTFT P95 (s) | TG P50 (s) | TG P95 (s) | Latency P50 (s) | Latency P95 (s) | Input Tokens | Output Tokens |
172+ | ------| ------| ------| ------| ------| ------| ------| ------| ------| ------| ------| ------|
173+ | Mac16,6 | qwen/qwen3-vl-8b | 51.47 ± 1.30 | 53.62 ± 1.82 | 1.58 ± 0.01 | 1.77 ± 0.07 | 9.62 ± 0.48 | 13.42 ± 0.37 | 11.24 ± 0.48 | 15.06 ± 0.30 | 310 | 5966 |
179174
180175** OLLAMA**
181176
182- | Device | Model | E2E TPS | TTFT (s) | TG (s) | E2E Latency (s) | Input Tokens | Output Tokens |
183- | ------| ------| ------| ------| ------| ------| ------| ------|
184- | ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | qwen3-vl:8b | 12.00 ± 0.19 | 64.86 ± 4.15 | 66.52 ± 0.54 | 127.01 ± 3.20 | 1814 | 14636 |
185- | Mac16,6 | qwen3-vl:8b | 46.47 ± 0.52 | 16.86 ± 0.21 | 17.17 ± 0.17 | 33.69 ± 0.54 | 1814 | 15516 |
177+ | Device | Model | TPS P50 | TPS P95 | TTFT P50 (s) | TTFT P95 (s) | TG P50 (s) | TG P95 (s) | Latency P50 (s) | Latency P95 (s) | Input Tokens | Output Tokens |
178+ | ------| ------| ------| ------| ------| ------| ------| ------| ------| ------| ------| ------|
179+ | Mac16,6 | qwen3-vl:8b | 47.78 ± 4.93 | 49.61 ± 6.79 | 15.29 ± 1.24 | 27.64 ± 0.60 | 16.28 ± 0.91 | 19.59 ± 1.52 | 33.09 ± 3.44 | 44.33 ± 0.41 | 1814 | 15490 |
186180
187181![ VLM TTFT vs Input Tokens] ( results/plots/vlm_ttft_vs_input_tokens.png )
188182
@@ -193,14 +187,14 @@ NoBS was built to understand how different devices — from everyday laptops and
193187
194188* Generation time vs output token count for multimodal responses. Lower values are faster.*
195189
196- ![ VLM TTFT Performance] ( results/plots/vlm_ttft .png )
190+ ![ VLM E2E Latency Performance] ( results/plots/vlm_latency .png )
197191
198- * Time To First Token (TTFT) - Lower is better. Measures response latency .*
192+ * End-to-End Latency P50 - Lower is better. Measures full request-to- response time .*
199193
200194
201195![ VLM Throughput Performance] ( results/plots/vlm_tps.png )
202196
203- * Token Generation per second (TG ) - Higher is better. Measures token generation.*
197+ * Token Generation per second (TPS ) - Higher is better. Measures token generation speed .*
204198
205199
206200---
0 commit comments