bogdanminko
diff --git a/‎Makefile‎
Lines changed: 2 additions & 2 deletions b/‎Makefile‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 35 additions & 41 deletions b/‎README.md‎
Lines changed: 35 additions & 41 deletions
diff --git a/‎docs/metrics.md‎
Lines changed: 139 additions & 0 deletions b/‎docs/metrics.md‎
Lines changed: 139 additions & 0 deletions
diff --git a/‎main.py‎
Lines changed: 0 additions & 4 deletions b/‎main.py‎
Lines changed: 0 additions & 4 deletions
diff --git a/‎results/plots/embeddings_performance.png‎
-53.9 KB b/‎results/plots/embeddings_performance.png‎
-53.9 KB
diff --git a/‎results/plots/llm_latency.png‎
140 KB b/‎results/plots/llm_latency.png‎
140 KB
diff --git a/‎results/plots/llm_tg_vs_output_tokens.png‎
-163 KB b/‎results/plots/llm_tg_vs_output_tokens.png‎
-163 KB
diff --git a/‎results/plots/llm_tps.png‎
-53.1 KB b/‎results/plots/llm_tps.png‎
-53.1 KB
diff --git a/‎results/plots/llm_ttft.png‎
-190 KB b/‎results/plots/llm_ttft.png‎
-190 KB
diff --git a/‎results/plots/llm_ttft_vs_input_tokens.png‎
-127 KB b/‎results/plots/llm_ttft_vs_input_tokens.png‎
-127 KB
@@ -12,7 +12,7 @@ generate:
 bench:
 	@echo "🆕 Starting La Perf benchmark"
 	@uv run python main.py
-	@echo "✨ Done! Run 'make' to update results in README.md"
+	@echo "✨ Done! Run 'make generate' to update results in README.md"
 
 # Run pre-commit hooks on all files
 format:
@@ -22,7 +22,7 @@ format:
 # Run linting only (ruff)
 lint:
 	@echo "🔍 Running ruff linter..."
-	@uv run ruff check src/ main.py
+	@uvx ruff check src/ main.py
 
 # Clean Python cache files
 clean:
 
@@ -33,27 +33,28 @@ It’s designed for **AI/ML engineers** who prefer to run workloads locally —
 ## Overview
 ### Tasks
 La Perf is a collection of reproducible tests and community-submitted results for :
-- #### 🧩 **Embeddings** — ✅ Ready (sentence-transformers, [IMDB dataset](https://huggingface.co/datasets/stanfordnlp/imdb))
+- #### **Embeddings** — ✅ Ready (sentence-transformers, [IMDB dataset](https://huggingface.co/datasets/stanfordnlp/imdb))
    sts models:
    - [thenlper/gte-large](https://huggingface.co/thenlper/gte-large)
    - [modernbert-embed-base](https://huggingface.co/nomic-ai/modernbert-embed-base)
-- #### 💬 **LLM inference** — ✅ Ready (LM Studio and Ollama, [Awesome Prompts dataset](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts))
+- #### **LLM inference** — ✅ Ready (LM Studio and Ollama, [Awesome Prompts dataset](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts))
    llm models:
    - **LM Studio**: [gpt-oss-20b](https://lmstudio.ai/models/openai/gpt-oss-20b)
      - *macOS*: `mlx-community/gpt-oss-20b-MXFP4-Q8` (MLX MXFP4-Q8)
      - *Other platforms*: `lmstudio-community/gpt-oss-20b-GGUF` (GGUF)
    - **Ollama**: [gpt-oss-20b](https://ollama.com/library/gpt-oss:20b)
-      -
-- #### 👁️ **VLM inference** — ✅ Ready (LM Studio and Ollama, [Hallucination_COCO dataset](https://huggingface.co/datasets/DogNeverSleep/Hallucination_COCO))
+
+
+- #### **VLM inference** — ✅ Ready (LM Studio and Ollama, [Hallucination_COCO dataset](https://huggingface.co/datasets/DogNeverSleep/Hallucination_COCO))
    vlm models:
    - **LM Studio**: [Qwen3-VL-8B-Instruct](https://lmstudio.ai/models/qwen/qwen3-vl-8b)
-     - *macOS*: `lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit` (MLX 8-bit)
+     - *macOS*: `lmstudio-community/Qwen3-VL-8B-Instruct-MLX-4bit` (MLX 4-bit)
      - *Other platforms*: `lmstudio-community/Qwen3-VL-8B-Instruct-GGUF-Q4_K_M` (Q4_K_M)
    - **Ollama**: [qwen3-vl:8b](https://ollama.com/library/qwen3-vl:8b)
       - **all platforms**: `qwen3-vl:8b` (Q4_K_M)
-- #### 🎨 **Diffusion image generation** — 📋 Planned
-- #### 🗣️ **Speach to Text** - 📋 Planned (whisper)
-- #### 🔬 **Classic ML** — 📋 Planned (scikit-learn, XGBoost, LightGBM, Catboost)
+- #### **Diffusion image generation** — 📋 Planned
+- #### **Speach to Text** - 📋 Planned (whisper)
+- #### **Classic ML** — 📋 Planned (scikit-learn, XGBoost, LightGBM, Catboost)
 
 **Note For mac-users**: If it's possible prefer to use lmstudio with `mlx` backend, which gives 10-20% more performance then `gguf`. If you run ollama (by default benchmarks runs both lmstudio and ollama) then you'll see a difference between `mlx` and `gguf` formats.
 
@@ -86,13 +87,12 @@ NoBS was built to understand how different devices — from everyday laptops and
 
 ## Benchmark Results
 
-> **Last Updated**: 2025-11-05
+> **Last Updated**: 2025-11-07
 ### 🏆 Overall Ranking
 
 | Rank | Device | Platform | CPU | RAM | GPU | VRAM | Embeddings, sts (s) | LLM, lms (s) | LLM, ollama (s) | VLM, lms (s) | VLM, ollama (s) | Total Time (s) |
 |------|------|------|------|------|------|------|------|------|------|------|------|------|
-| 🥇 1 | Mac16,6 | 🍏 macOS | Apple M4 Max (14) | 36 GB | Apple M4 Max (32 cores) | shared with system RAM | 52.92 | 1.02 | 15.99 | 10.57 | 33.69 | **114.19** |
-| 🥈 2 | ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | 🐧 Linux | Intel(R) Core(TM) Ultra 9 185H (16) | 23 GB | NVIDIA GeForce RTX 4060 Laptop GPU | 8 GB | 19.99 | 7.60 | 30.22 | 25.58 | 127.01 | **210.40** |
+| 🥇 1 | Mac16,6 | 🍏 macOS | Apple M4 Max (14) | 36 GB | Apple M4 Max (32 cores) | shared with system RAM | 53.76 | 1.28 | 4.64 | 11.24 | 33.09 | **104.01** |
 
 *sts - sentence transformers*
 
@@ -106,21 +106,19 @@ NoBS was built to understand how different devices — from everyday laptops and
 
 | Device | CPU Usage (p50/p95) | RAM Used (p50/p95) | GPU Usage (p50/p95) | GPU Temp (p50/p95) | Battery (start/end/Δ) | GPU Power (p50/p95) | CPU Power (p50/p95) |
 |------|------|------|------|------|------|------|------|
-| ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | 27.1% / 29.9% | 10.6GB / 13.4GB | 12.0% / 35.0% | 65.0°C / 66.0°C | 72.0% / 100.0% / -28.0% | 18.1W / 41.9W | 18.1W / 41.9W |
-| Mac16,6 | 4.6% / 9.4% | 20.9GB / 22.4GB | 97.0% / 100.0% | N/A | 65% / 8% / +57.0% | 11.7W / 36.0W | 1.4W / 2.8W |
+| Mac16,6 | 4.0% / 12.0% | 22.3GB / 23.9GB | 97.0% / 100.0% | N/A | 85% / 85% / +0.0% | 11.7W / 32.3W | 1.1W / 2.2W |
 
 *p50 = median, p95 = 95th percentile*
 
 
 
 ### Embeddings
 
-#### Text Embeddings (100 IMDB samples)
+#### Text Embeddings (3000 IMDB samples)
 
-| Device | Model | Rows/sec | Time (s) | Embedding Dim | Batch Size |
+| Device | Model | RPS (mean ± std) | Time (s) (mean ± std) | Embedding Dim | Batch Size |
 |------|------|------|------|------|------|
-| ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | nomic-ai/modernbert-embed-base | 150.06 ± 0.39 | 19.99 ± 0.05 | 768 | 32 |
-| Mac16,6 | nomic-ai/modernbert-embed-base | 56.69 ± 0.29 | 52.92 ± 0.27 | 768 | 32 |
+| Mac16,6 | nomic-ai/modernbert-embed-base | 55.81 ± 0.75 | 53.76 ± 0.72 | 768 | 32 |
 
 ![Embeddings Performance Profile](results/plots/embeddings_performance.png)
 
@@ -129,22 +127,20 @@ NoBS was built to understand how different devices — from everyday laptops and
 
 ### LLMs
 
-#### LLM Inference (3 prompts from awesome-chatgpt-prompts)
+#### LLM Inference (10 prompts from awesome-chatgpt-prompts)
 
 
 **LM STUDIO**
 
-| Device | Model | E2E TPS | TTFT (s) | TG (s) | E2E Latency (s) | Input Tokens | Output Tokens |
-|------|------|------|------|------|------|------|------|
-| ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | openai/gpt-oss-20b | 13.10 ± 0.94 | 3.64 ± 0.51 | 1.67 ± 0.09 | 7.60 ± 1.19 | 1728 | 3978 |
-| Mac16,6 | openai/gpt-oss-20b | 70.83 ± 1.61 | 0.75 ± 0.01 | 0.23 ± 0.00 | 1.02 ± 0.02 | 1728 | 3968 |
+| Device | Model | TPS P50 | TPS P95 | TTFT P50 (s) | TTFT P95 (s) | TG P50 (s) | TG P95 (s) | Latency P50 (s) | Latency P95 (s) | Input Tokens | Output Tokens |
+|------|------|------|------|------|------|------|------|------|------|------|------|
+| Mac16,6 | openai/gpt-oss-20b | 56.53 ± 1.65 | 77.21 ± 1.99 | 0.92 ± 0.02 | 1.23 ± 0.03 | 0.24 ± 0.00 | 17.09 ± 0.57 | 1.28 ± 0.04 | 18.28 ± 0.60 | 1728 | 3906 |
 
 **OLLAMA**
 
-| Device | Model | E2E TPS | TTFT (s) | TG (s) | E2E Latency (s) | Input Tokens | Output Tokens |
-|------|------|------|------|------|------|------|------|
-| ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | gpt-oss:20b | 13.11 ± 0.35 | 21.03 ± 0.97 | 2.47 ± 0.12 | 30.22 ± 1.68 | 1728 | 10036 |
-| Mac16,6 | gpt-oss:20b | 64.21 ± 0.20 | 8.83 ± 0.05 | 0.32 ± 0.00 | 15.99 ± 0.05 | 1728 | 12159 |
+| Device | Model | TPS P50 | TPS P95 | TTFT P50 (s) | TTFT P95 (s) | TG P50 (s) | TG P95 (s) | Latency P50 (s) | Latency P95 (s) | Input Tokens | Output Tokens |
+|------|------|------|------|------|------|------|------|------|------|------|------|
+| Mac16,6 | gpt-oss:20b | 61.03 ± 4.29 | 63.50 ± 6.07 | 4.18 ± 0.31 | 56.83 ± 0.82 | 0.46 ± 0.04 | 25.17 ± 0.33 | 4.64 ± 0.35 | 79.54 ± 0.91 | 1728 | 12939 |
 
 ![LLM TTFT vs Input Tokens](results/plots/llm_ttft_vs_input_tokens.png)
 
@@ -155,34 +151,32 @@ NoBS was built to understand how different devices — from everyday laptops and
 
 *Generation time growth relative to output length. Lower values reflect faster completions.*
 
-![LLM TTFT Performance](results/plots/llm_ttft.png)
+![LLM E2E Latency Performance](results/plots/llm_latency.png)
 
-*Time To First Token (TTFT) - Lower is better. Measures response latency.*
+*End-to-End Latency P50 - Lower is better. Measures full request-to-response time.*
 
 
 ![LLM Throughput Performance](results/plots/llm_tps.png)
 
-*Token Generation per second (TG) - Higher is better. Measures token generation.*
+*Token Generation per second (TPS) - Higher is better. Measures token generation speed.*
 
 
 ### VLMs
 
-#### VLM Inference (3 questions from Hallucination_COCO)
+#### VLM Inference (10 questions from Hallucination_COCO)
 
 
 **LM STUDIO**
 
-| Device | Model | E2E TPS | TTFT (s) | TG (s) | E2E Latency (s) | Input Tokens | Output Tokens |
-|------|------|------|------|------|------|------|------|
-| ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | qwen/qwen3-vl-8b | 20.20 ± 0.06 | 0.79 ± 0.06 | 24.75 ± 0.07 | 25.58 ± 0.10 | 290 | 5128 |
-| Mac16,6 | qwen/qwen3-vl-8b | 54.27 ± 1.66 | 1.55 ± 0.06 | 9.04 ± 0.43 | 10.57 ± 0.45 | 310 | 6043 |
+| Device | Model | TPS P50 | TPS P95 | TTFT P50 (s) | TTFT P95 (s) | TG P50 (s) | TG P95 (s) | Latency P50 (s) | Latency P95 (s) | Input Tokens | Output Tokens |
+|------|------|------|------|------|------|------|------|------|------|------|------|
+| Mac16,6 | qwen/qwen3-vl-8b | 51.47 ± 1.30 | 53.62 ± 1.82 | 1.58 ± 0.01 | 1.77 ± 0.07 | 9.62 ± 0.48 | 13.42 ± 0.37 | 11.24 ± 0.48 | 15.06 ± 0.30 | 310 | 5966 |
 
 **OLLAMA**
 
-| Device | Model | E2E TPS | TTFT (s) | TG (s) | E2E Latency (s) | Input Tokens | Output Tokens |
-|------|------|------|------|------|------|------|------|
-| ASUSTeK COMPUTER INC. ASUS Vivobook Pro 15 N6506MV_N6506MV 1.0 | qwen3-vl:8b | 12.00 ± 0.19 | 64.86 ± 4.15 | 66.52 ± 0.54 | 127.01 ± 3.20 | 1814 | 14636 |
-| Mac16,6 | qwen3-vl:8b | 46.47 ± 0.52 | 16.86 ± 0.21 | 17.17 ± 0.17 | 33.69 ± 0.54 | 1814 | 15516 |
+| Device | Model | TPS P50 | TPS P95 | TTFT P50 (s) | TTFT P95 (s) | TG P50 (s) | TG P95 (s) | Latency P50 (s) | Latency P95 (s) | Input Tokens | Output Tokens |
+|------|------|------|------|------|------|------|------|------|------|------|------|
+| Mac16,6 | qwen3-vl:8b | 47.78 ± 4.93 | 49.61 ± 6.79 | 15.29 ± 1.24 | 27.64 ± 0.60 | 16.28 ± 0.91 | 19.59 ± 1.52 | 33.09 ± 3.44 | 44.33 ± 0.41 | 1814 | 15490 |
 
 ![VLM TTFT vs Input Tokens](results/plots/vlm_ttft_vs_input_tokens.png)
 
@@ -193,14 +187,14 @@ NoBS was built to understand how different devices — from everyday laptops and
 
 *Generation time vs output token count for multimodal responses. Lower values are faster.*
 
-![VLM TTFT Performance](results/plots/vlm_ttft.png)
+![VLM E2E Latency Performance](results/plots/vlm_latency.png)
 
-*Time To First Token (TTFT) - Lower is better. Measures response latency.*
+*End-to-End Latency P50 - Lower is better. Measures full request-to-response time.*
 
 
 ![VLM Throughput Performance](results/plots/vlm_tps.png)
 
-*Token Generation per second (TG) - Higher is better. Measures token generation.*
+*Token Generation per second (TPS) - Higher is better. Measures token generation speed.*
 
 
 ---
 
@@ -0,0 +1,139 @@
+# Metrics
+
+This section describes how La Perf calculates and evaluates metrics across different benchmark tasks.
+
+## Embeddings
+
+### Overview
+Embedding benchmarks use the `sentence-transformers` library for encoding operations.
+
+| Metric | Description | Unit |
+|--------|-------------|------|
+| **E2E Latency** | Total time to encode full dataset | seconds |
+| **RPS** | Rows Per Second (throughput) | rows/s |
+
+### Measurement Methodology
+The total encoding latency is measured around the `.encode()` call, which internally handles batching.
+Each run includes device synchronization before and after encoding to ensure accurate timing.
+
+**Implementation details:**
+- Uses `torch.cuda.synchronize()` for NVIDIA GPUs
+- Uses `torch.mps.synchronize()` for Apple Silicon GPUs
+- Ensures complete device-side execution before measurement
+
+### Cross-Run Statistics
+For multiple benchmark runs, simple mean and standard deviation are calculated:
+```python
+mean(run1, run2, run3) ± std(run1, run2, run3)
+
+# Example with RPS:
+# final_mean_rps = mean([run1_rps, run2_rps, run3_rps])
+# final_std_rps = std([run1_rps, run2_rps, run3_rps])
+# On table you see: final_mean_rps ± final_std_rps
+```
+
+**Note:** Embeddings use direct mean/std across runs, not percentile-based statistics.
+
+---
+
+
+## LLMs & VLMs
+
+### Overview
+
+| Metric | Description | Unit |
+|--------|-------------|------|
+| **TTFT** | Time To First Token — prompt processing latency | seconds |
+| **TG** | Token Generation — time spent generating output | seconds |
+| **TPS** | Tokens Per Second — generation throughput | tokens/s |
+| **E2E Latency** | End-to-end request latency | seconds |
+
+### Measurement Methodology
+
+#### Streaming & Token Counting
+
+La Perf uses streaming APIs (Ollama, LM Studio via OpenAI SDK) to measure both latency and throughput.
+
+**Critical distinction:** API chunks ≠ tokens
+
+The server sends responses in chunks, but each chunk may contain multiple tokens. Token counts are obtained from server-side usage statistics.
+
+#### Per-Request Measurements
+
+For each prompt in the benchmark:
+
+| Timestamp | Description |
+|-----------|-------------|
+| `t0_stream` | Request start time |
+| `first_token_ts` | First chunk received (≈ first token) |
+| `t1_stream` | Response complete |
+
+| Token Count | Source |
+|-------------|--------|
+| `input_tokens` | From server usage stats |
+| `output_tokens` | From server usage stats |
+
+#### Metric Calculations
+
+| Metric | Formula | Notes |
+|--------|---------|-------|
+| **E2E Latency** | `t1_stream - t0_stream` | Total request time |
+| **TTFT** | `first_token_ts - t0_stream` | Prompt processing time |
+| **TG** | `t1_stream - first_token_ts` | Generation phase time |
+| **TPS** | `output_tokens / E2E Latency` | Client-side throughput metric |
+
+#### Why TPS = output_tokens / E2E Latency?
+
+**Incorrect approach:**
+```python
+TPS = output_tokens / TG  # ❌ WRONG
+# Example: 38 tokens / 0.0007s = 52 285.714 tokens/sec
+```
+> Fifty-two thousand tokens per second? Goodbye H100, my local PC just destroyed you!
+
+Yeah, no. This calculation is hilariously wrong.
+
+This vastly overestimates performance because `TG` measures only the time between first and last chunk, not the actual token generation time.
+
+**Correct approach:**
+```python
+TPS = output_tokens / E2E Latency  # ✅ CORRECT
+# Example: 38 tokens / 0.6668s = 56.988 tokens/sec
+```
+
+This reflects real-world throughput from the client perspective.
+
+**Limitation:** For very short outputs (1-2 chunks), `TG` may not accurately represent generation time. Server-side metrics would be more precise but are not currently collected.
+
+---
+
+### Per-Metric Percentiles
+For each metric across all requests, La Perf computes:
+
+| Percentile | Description |
+|------------|-------------|
+| **P25** | 25th percentile |
+| **P50** | Median |
+| **P75** | 75th percentile |
+| **P95** | 95th percentile |
+
+### Cross-Run Statistics
+For multiple benchmark runs, statistics are calculated from percentile values across runs:
+```python
+mean(run1_percentile, run2_percentile, run3_percentile) ± std(run1_percentile, run2_percentile, run3_percentile)
+
+# Example with P50 TPS:
+# final_p50_tps = mean([run1_p50_tps, run2_p50_tps, run3_p50_tps])
+# final_p50_tps_std = std([run1_p50_tps, run2_p50_tps, run3_p50_tps])
+# On table you see: final_p50_tps ± final_p50_tps_std
+```
+
+**Note:** LLM/VLM compute percentiles per run first, then aggregate across runs. This differs from Embeddings which use direct mean/std.
+
+These aggregated values appear in the results tables.
+
+---
+### Notes
+- All timing values are wall-clock times measured via `time.perf_counter()`.
+- Benchmarks are repeated at least 3 times to compute mean and standard deviation.
+- All metrics are device-synchronized and exclude warmup runs.
@@ -1,7 +1,3 @@
-import os
-
-os.environ["TOKENIZERS_PARALLELISM"] = "false"
-
 from src.system_info.device_info import get_device_info
 from src.system_info.power_metrics import PowerMonitor
 from src.cli import display_device_info, display_final_summary, select_benchmarks