Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations by tolgaki · Pull Request #1464 · Netflix/vmaf

tolgaki · 2026-02-15T19:46:30Z

AVX2 PSNR SSE computation (32 pixels/iteration with runtime CPU dispatch)
AVX2 SAD for motion feature
Thread pool job object pool (free list + 64-byte inline data buffer)
Thread pool thundering herd fix (signal vs broadcast)
Feature collector initial capacity 8 -> 512
integer_adm.c: pow(2, N) -> bit shifts/constants; eliminate redundant float conversions
integer_vif.c: Remove unnecessary epsilon; cache g*g
predict.c: Stack-allocate SVM nodes; lazy-cache name_with_opt
convolution.c: Hoist stride multiplication out of inner loops
Comprehensive test suite (11 tests covering all optimized paths)
All 18 meson tests pass

Phase 1 - Hot-path computation optimizations: - Hoist stride multiplications out of inner convolution loops (convolution.c) - Replace runtime pow(2,N) calls with compile-time bit shifts and ldexp across integer_adm.c and integer_motion.c (~20 instances) - Remove unnecessary epsilon and cache g*g in VIF statistic loops, eliminating redundant FP division per pixel (integer_vif.c) - Eliminate redundant float conversions in ADM decouple by using integer-domain angle/sign checks instead (integer_adm.c) Phase 1 - Threading and allocation optimizations: - Fix thundering herd: use pthread_cond_signal instead of broadcast in thread pool job enqueue (thread_pool.c) - Use stack allocation for SVM node array in predict path, avoiding per-frame malloc/free churn (predict.c) - Cache generated feature names in model to avoid repeated context creation/destruction per prediction (predict.c, model.h, model.c) - Increase feature vector initial capacity from 8 to 512 to reduce realloc frequency for typical workloads (feature_collector.c) Phase 2 - SIMD coverage: - Add AVX2 SAD (Sum of Absolute Differences) for motion feature, processing 16 uint16 elements per iteration (motion_avx2.c/h) https://claude.ai/code/session_01MaZDdztsZcd5y2PQo1av2G

- Add AVX2 SSE computation for 8-bit PSNR (32 pixels/iteration via cvtepu8_epi16 + madd_epi16), with scalar C fallback and runtime CPU dispatch - Add thread pool job object pool with free list recycling and 64-byte inline data buffer to eliminate per-job malloc/free overhead - Add comprehensive test suite (test_perf_optimizations) covering: thread pool (1000 jobs, data passing), feature collector (capacity 512, 20 features), predict (score consistency with name caching), PSNR/VIF/ADM/motion feature extractors, and end-to-end VMAF scoring https://claude.ai/code/session_01MaZDdztsZcd5y2PQo1av2G

…83) Port the thread-pool portion of Netflix upstream PR Netflix#1464 (closed) into libvmaf/src/thread_pool.c. Eliminates malloc/free churn from the enqueue hot path. Mechanics: - New `VmafThreadPool::free_jobs` list (protected by queue.lock) recycles `VmafThreadPoolJob` slots between enqueue calls instead of malloc/free on every job. - New `char inline_data[JOB_INLINE_DATA_SIZE=64]` at the tail of the job struct. Payloads <= 64 bytes are copied into it and `job->data = job->inline_data`, avoiding a second malloc on the common caller path (main extractor dispatch, MCP frame events). - Split cleanup: `_clear_data` distinguishes inline vs heap via `job->data != job->inline_data`; `_recycle` pushes onto free list; `_destroy` is kept for destructor-only use. - Runner now `_recycle`s finished jobs; `vmaf_thread_pool_destroy` walks and frees the recycle list after the workers exit. Adapted to the fork's `void (*func)(void *data, void **thread_data)` signature and `VmafThreadPoolWorker` per-worker-data path, which Netflix upstream lacks. No API change; callers unmodified. Verification: - meson test -C build: 32/32 pass (threaded framework tests included). - Netflix golden pair (src01_hrc00/01_576x324, full VMAF with vmaf_v0.6.1 model): bit-identical scores between `--threads 1` and `--threads 4` (attribute order may differ — insertion ordering in feature_collector, unchanged by this PR). - Same pair, `--threads 4`: bit-identical between VMAF_CPU_MASK=0 (scalar) and =255 (SIMD). diff exit 0. - clang-tidy -p build libvmaf/src/thread_pool.c: zero warnings, no NOLINT. - Micro-benchmark (500k jobs, 4 threads, int payload): BEFORE (master): ~1.20 M jobs/sec median AFTER (this PR): ~2.20 M jobs/sec median => ~1.8-2.6x enqueue throughput win. Closes the thread-pool half of backlog T3-6. The AVX2 PSNR half of T3-6 was already landed via fork commit 81fcd42 (with additional AVX-512 + NEON variants beyond upstream's AVX2-only coverage). Deliverables (ADR-0108): 1. research digest: no digest needed - narrow upstream port, no novel algorithm 2. decision matrix: ADR-0147 §Alternatives considered 3. AGENTS.md invariant: libvmaf/AGENTS.md (thread-pool recycling + inline data buffer invariant with inline_data guard) 4. reproducer: bench + Netflix-golden-pair threaded scalar-vs-SIMD diff exit 0 5. CHANGELOG: fork entry under Changed 6. rebase-notes: entry 0040 Co-authored-by: Lusoris <lusoris@pm.me> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude and others added 3 commits February 15, 2026 15:56

Merge pull request #1 from tolgaki/claude/optimize-vmaf-code-Odccg

86e945a

kylophone closed this Apr 22, 2026

lusoris mentioned this pull request Apr 23, 2026

perf(thread_pool): recycle job slots + inline data buffer (ADR-0147) lusoris/vmaf#83

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations#1464

Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations#1464
tolgaki wants to merge 3 commits intoNetflix:masterfrom
tolgaki:master

tolgaki commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tolgaki commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants