Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations#1464
Closed
tolgaki wants to merge 3 commits intoNetflix:masterfrom
Closed
Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations#1464tolgaki wants to merge 3 commits intoNetflix:masterfrom
tolgaki wants to merge 3 commits intoNetflix:masterfrom
Conversation
tolgaki
commented
Feb 15, 2026
- AVX2 PSNR SSE computation (32 pixels/iteration with runtime CPU dispatch)
- AVX2 SAD for motion feature
- Thread pool job object pool (free list + 64-byte inline data buffer)
- Thread pool thundering herd fix (signal vs broadcast)
- Feature collector initial capacity 8 -> 512
- integer_adm.c: pow(2, N) -> bit shifts/constants; eliminate redundant float conversions
- integer_vif.c: Remove unnecessary epsilon; cache g*g
- predict.c: Stack-allocate SVM nodes; lazy-cache name_with_opt
- convolution.c: Hoist stride multiplication out of inner loops
- Comprehensive test suite (11 tests covering all optimized paths)
- All 18 meson tests pass
Phase 1 - Hot-path computation optimizations: - Hoist stride multiplications out of inner convolution loops (convolution.c) - Replace runtime pow(2,N) calls with compile-time bit shifts and ldexp across integer_adm.c and integer_motion.c (~20 instances) - Remove unnecessary epsilon and cache g*g in VIF statistic loops, eliminating redundant FP division per pixel (integer_vif.c) - Eliminate redundant float conversions in ADM decouple by using integer-domain angle/sign checks instead (integer_adm.c) Phase 1 - Threading and allocation optimizations: - Fix thundering herd: use pthread_cond_signal instead of broadcast in thread pool job enqueue (thread_pool.c) - Use stack allocation for SVM node array in predict path, avoiding per-frame malloc/free churn (predict.c) - Cache generated feature names in model to avoid repeated context creation/destruction per prediction (predict.c, model.h, model.c) - Increase feature vector initial capacity from 8 to 512 to reduce realloc frequency for typical workloads (feature_collector.c) Phase 2 - SIMD coverage: - Add AVX2 SAD (Sum of Absolute Differences) for motion feature, processing 16 uint16 elements per iteration (motion_avx2.c/h) https://claude.ai/code/session_01MaZDdztsZcd5y2PQo1av2G
- Add AVX2 SSE computation for 8-bit PSNR (32 pixels/iteration via cvtepu8_epi16 + madd_epi16), with scalar C fallback and runtime CPU dispatch - Add thread pool job object pool with free list recycling and 64-byte inline data buffer to eliminate per-job malloc/free overhead - Add comprehensive test suite (test_perf_optimizations) covering: thread pool (1000 jobs, data passing), feature collector (capacity 512, 20 features), predict (score consistency with name caching), PSNR/VIF/ADM/motion feature extractors, and end-to-end VMAF scoring https://claude.ai/code/session_01MaZDdztsZcd5y2PQo1av2G
15 tasks
lusoris
added a commit
to lusoris/vmaf
that referenced
this pull request
Apr 23, 2026
…83) Port the thread-pool portion of Netflix upstream PR Netflix#1464 (closed) into libvmaf/src/thread_pool.c. Eliminates malloc/free churn from the enqueue hot path. Mechanics: - New `VmafThreadPool::free_jobs` list (protected by queue.lock) recycles `VmafThreadPoolJob` slots between enqueue calls instead of malloc/free on every job. - New `char inline_data[JOB_INLINE_DATA_SIZE=64]` at the tail of the job struct. Payloads <= 64 bytes are copied into it and `job->data = job->inline_data`, avoiding a second malloc on the common caller path (main extractor dispatch, MCP frame events). - Split cleanup: `_clear_data` distinguishes inline vs heap via `job->data != job->inline_data`; `_recycle` pushes onto free list; `_destroy` is kept for destructor-only use. - Runner now `_recycle`s finished jobs; `vmaf_thread_pool_destroy` walks and frees the recycle list after the workers exit. Adapted to the fork's `void (*func)(void *data, void **thread_data)` signature and `VmafThreadPoolWorker` per-worker-data path, which Netflix upstream lacks. No API change; callers unmodified. Verification: - meson test -C build: 32/32 pass (threaded framework tests included). - Netflix golden pair (src01_hrc00/01_576x324, full VMAF with vmaf_v0.6.1 model): bit-identical scores between `--threads 1` and `--threads 4` (attribute order may differ — insertion ordering in feature_collector, unchanged by this PR). - Same pair, `--threads 4`: bit-identical between VMAF_CPU_MASK=0 (scalar) and =255 (SIMD). diff exit 0. - clang-tidy -p build libvmaf/src/thread_pool.c: zero warnings, no NOLINT. - Micro-benchmark (500k jobs, 4 threads, int payload): BEFORE (master): ~1.20 M jobs/sec median AFTER (this PR): ~2.20 M jobs/sec median => ~1.8-2.6x enqueue throughput win. Closes the thread-pool half of backlog T3-6. The AVX2 PSNR half of T3-6 was already landed via fork commit 81fcd42 (with additional AVX-512 + NEON variants beyond upstream's AVX2-only coverage). Deliverables (ADR-0108): 1. research digest: no digest needed - narrow upstream port, no novel algorithm 2. decision matrix: ADR-0147 §Alternatives considered 3. AGENTS.md invariant: libvmaf/AGENTS.md (thread-pool recycling + inline data buffer invariant with inline_data guard) 4. reproducer: bench + Netflix-golden-pair threaded scalar-vs-SIMD diff exit 0 5. CHANGELOG: fork entry under Changed 6. rebase-notes: entry 0040 Co-authored-by: Lusoris <lusoris@pm.me> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.