Skip to content

Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations#1464

Closed
tolgaki wants to merge 3 commits intoNetflix:masterfrom
tolgaki:master
Closed

Optimize VMAF core: AVX2 PSNR, thread pool job pool, and micro-optimizations#1464
tolgaki wants to merge 3 commits intoNetflix:masterfrom
tolgaki:master

Conversation

@tolgaki
Copy link
Copy Markdown

@tolgaki tolgaki commented Feb 15, 2026

  • AVX2 PSNR SSE computation (32 pixels/iteration with runtime CPU dispatch)
  • AVX2 SAD for motion feature
  • Thread pool job object pool (free list + 64-byte inline data buffer)
  • Thread pool thundering herd fix (signal vs broadcast)
  • Feature collector initial capacity 8 -> 512
  • integer_adm.c: pow(2, N) -> bit shifts/constants; eliminate redundant float conversions
  • integer_vif.c: Remove unnecessary epsilon; cache g*g
  • predict.c: Stack-allocate SVM nodes; lazy-cache name_with_opt
  • convolution.c: Hoist stride multiplication out of inner loops
  • Comprehensive test suite (11 tests covering all optimized paths)
  • All 18 meson tests pass

claude and others added 3 commits February 15, 2026 15:56
Phase 1 - Hot-path computation optimizations:
- Hoist stride multiplications out of inner convolution loops (convolution.c)
- Replace runtime pow(2,N) calls with compile-time bit shifts and ldexp
  across integer_adm.c and integer_motion.c (~20 instances)
- Remove unnecessary epsilon and cache g*g in VIF statistic loops,
  eliminating redundant FP division per pixel (integer_vif.c)
- Eliminate redundant float conversions in ADM decouple by using
  integer-domain angle/sign checks instead (integer_adm.c)

Phase 1 - Threading and allocation optimizations:
- Fix thundering herd: use pthread_cond_signal instead of broadcast
  in thread pool job enqueue (thread_pool.c)
- Use stack allocation for SVM node array in predict path, avoiding
  per-frame malloc/free churn (predict.c)
- Cache generated feature names in model to avoid repeated context
  creation/destruction per prediction (predict.c, model.h, model.c)
- Increase feature vector initial capacity from 8 to 512 to reduce
  realloc frequency for typical workloads (feature_collector.c)

Phase 2 - SIMD coverage:
- Add AVX2 SAD (Sum of Absolute Differences) for motion feature,
  processing 16 uint16 elements per iteration (motion_avx2.c/h)

https://claude.ai/code/session_01MaZDdztsZcd5y2PQo1av2G
- Add AVX2 SSE computation for 8-bit PSNR (32 pixels/iteration via
  cvtepu8_epi16 + madd_epi16), with scalar C fallback and runtime
  CPU dispatch
- Add thread pool job object pool with free list recycling and 64-byte
  inline data buffer to eliminate per-job malloc/free overhead
- Add comprehensive test suite (test_perf_optimizations) covering:
  thread pool (1000 jobs, data passing), feature collector (capacity
  512, 20 features), predict (score consistency with name caching),
  PSNR/VIF/ADM/motion feature extractors, and end-to-end VMAF scoring

https://claude.ai/code/session_01MaZDdztsZcd5y2PQo1av2G
@kylophone kylophone closed this Apr 22, 2026
lusoris added a commit to lusoris/vmaf that referenced this pull request Apr 23, 2026
…83)

Port the thread-pool portion of Netflix upstream PR Netflix#1464 (closed)
into libvmaf/src/thread_pool.c. Eliminates malloc/free churn from the
enqueue hot path.

Mechanics:
- New `VmafThreadPool::free_jobs` list (protected by queue.lock)
  recycles `VmafThreadPoolJob` slots between enqueue calls instead
  of malloc/free on every job.
- New `char inline_data[JOB_INLINE_DATA_SIZE=64]` at the tail of the
  job struct. Payloads <= 64 bytes are copied into it and
  `job->data = job->inline_data`, avoiding a second malloc on the
  common caller path (main extractor dispatch, MCP frame events).
- Split cleanup: `_clear_data` distinguishes inline vs heap via
  `job->data != job->inline_data`; `_recycle` pushes onto free list;
  `_destroy` is kept for destructor-only use.
- Runner now `_recycle`s finished jobs; `vmaf_thread_pool_destroy`
  walks and frees the recycle list after the workers exit.

Adapted to the fork's `void (*func)(void *data, void **thread_data)`
signature and `VmafThreadPoolWorker` per-worker-data path, which
Netflix upstream lacks. No API change; callers unmodified.

Verification:
- meson test -C build: 32/32 pass (threaded framework tests included).
- Netflix golden pair (src01_hrc00/01_576x324, full VMAF with
  vmaf_v0.6.1 model): bit-identical scores between `--threads 1` and
  `--threads 4` (attribute order may differ — insertion ordering in
  feature_collector, unchanged by this PR).
- Same pair, `--threads 4`: bit-identical between VMAF_CPU_MASK=0
  (scalar) and =255 (SIMD). diff exit 0.
- clang-tidy -p build libvmaf/src/thread_pool.c: zero warnings,
  no NOLINT.
- Micro-benchmark (500k jobs, 4 threads, int payload):
    BEFORE (master):  ~1.20 M jobs/sec median
    AFTER (this PR):  ~2.20 M jobs/sec median
  => ~1.8-2.6x enqueue throughput win.

Closes the thread-pool half of backlog T3-6. The AVX2 PSNR half of
T3-6 was already landed via fork commit 81fcd42 (with additional
AVX-512 + NEON variants beyond upstream's AVX2-only coverage).

Deliverables (ADR-0108):
 1. research digest: no digest needed - narrow upstream port, no novel algorithm
 2. decision matrix: ADR-0147 §Alternatives considered
 3. AGENTS.md invariant: libvmaf/AGENTS.md (thread-pool recycling
    + inline data buffer invariant with inline_data guard)
 4. reproducer: bench + Netflix-golden-pair threaded scalar-vs-SIMD
    diff exit 0
 5. CHANGELOG: fork entry under Changed
 6. rebase-notes: entry 0040

Co-authored-by: Lusoris <lusoris@pm.me>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants