Skip to content

Video stutter in Grim Fandango Remastered: decodeRGBX (YUV→RGBX) dominates CPU in emulated Theora playback #3599

@runlevel5

Description

@runlevel5

Hey ptitSeb,

I've been investigating a video playback stutter in Grim Fandango Remastered running on PPC64LE (POWER9) with BOX32 enabled. The intro movie stutters visibly for a few frames around the 2:20 mark. Audio plays fine throughout.

After profiling with perf + BOX64_DYNAREC_PERFMAP=1, the bottleneck is very clear:

Profiling results

% CPU Symbol What it is
~35-40% decodeRGBX YUV→RGBX color conversion (statically linked Theora)
3.3% __memcpy_power7 native memcpy
1.8% ppc64le_next dynarec interpreter fallback
~1.6% oc_dec_dc_unpredict_mcu_plane_c Theora DC coefficient decode
~0.6% oc_frag_recon_intra_c Theora fragment reconstruction
~0.4% oc_huff_token_decode_c Theora Huffman decode

The game statically links all of its Theora/Ogg/Vorbis code — no shared libraries to wrap.

What decodeRGBX does

It's a straightforward YUV→RGBX pixel converter with a tight inner loop (~450 bytes of x86 at 0x08226570). The loop body does:

  • Reads Y, Cb, Cr from separate planes via movzbl
  • Table lookups from 5 fixed-point coefficient tables (via mov with scaled index)
  • Fixed-point arithmetic: imul, add, sub, sar $0xd
  • Clamp to 0-255 using a cmp $0xff / neg / sar $0x1f pattern
  • Byte stores to the output RGBX buffer
  • Processes 4 pixels per iteration (2x2 block matching chroma subsampling)

No SIMD, no floating point — pure integer with lots of memory access through lookup tables.

What causes the stutter

Through strace analysis, I can see the movie data reader thread disappears from syscalls for 1.31 seconds (normally it cycles every ~1s). During this gap, no new decoded frames flow into the pipeline and the frame buffer runs dry. It's not I/O — all file reads complete in <1ms. The thread is just stuck executing emulated x86 Theora code.

The question

I wanted to ask for your guidance on the best approach here. A few ideas I had:

  1. Check if any x86 instructions in the hot loop fall back to ppc64le_next — there's 1.8% in the interpreter, and if some of that is from this loop, it could be a quick win to add JIT support for those opcodes.

  2. Native function replacement — Could box64 intercept calls to decodeRGBX (even though it's statically linked) and replace it with a native PPC64LE YUV→RGBX implementation? Kind of like what GOM does for library functions, but for a game symbol. Not sure if there's precedent for this.

  3. JIT codegen quality — The inner loop is mostly mov, movzbl, imul, add, sub, sar, cmp, jbe, neg with heavy use of stack-relative addressing (mov 0x28(%esp),%edi etc.). Is there anything in the PPC64LE dynarec that might generate suboptimal code for these patterns? For instance, the frequent sar $0xd + cmp $0xff + neg + sar $0x1f clamping sequence appears ~10 times.

  4. Just accept it — Maybe this is just the inherent cost of emulating a tight pixel-processing loop and there's not much to be done without native Theora libraries.

Any thoughts on which direction would be most productive? Happy to provide more data or test patches.

Environment

  • Platform: PPC64LE (POWER9), 32 cores, 64GB RAM
  • Build flags: cmake .. -DPPC64LE=1 -DBOX32=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo
  • Game: Grim Fandango Remastered (32-bit i386 ELF, not stripped, 9377 symbols)
  • Branch: based on current main with BOX32 wrappers for libogg, libvorbis, pulse-simple

Related: #3577 (dynarec block dispatch overhead on >4KB page systems)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions