Hey ptitSeb,
I've been investigating a video playback stutter in Grim Fandango Remastered running on PPC64LE (POWER9) with BOX32 enabled. The intro movie stutters visibly for a few frames around the 2:20 mark. Audio plays fine throughout.
After profiling with perf + BOX64_DYNAREC_PERFMAP=1, the bottleneck is very clear:
Profiling results
| % CPU |
Symbol |
What it is |
| ~35-40% |
decodeRGBX |
YUV→RGBX color conversion (statically linked Theora) |
| 3.3% |
__memcpy_power7 |
native memcpy |
| 1.8% |
ppc64le_next |
dynarec interpreter fallback |
| ~1.6% |
oc_dec_dc_unpredict_mcu_plane_c |
Theora DC coefficient decode |
| ~0.6% |
oc_frag_recon_intra_c |
Theora fragment reconstruction |
| ~0.4% |
oc_huff_token_decode_c |
Theora Huffman decode |
The game statically links all of its Theora/Ogg/Vorbis code — no shared libraries to wrap.
What decodeRGBX does
It's a straightforward YUV→RGBX pixel converter with a tight inner loop (~450 bytes of x86 at 0x08226570). The loop body does:
- Reads Y, Cb, Cr from separate planes via
movzbl
- Table lookups from 5 fixed-point coefficient tables (via
mov with scaled index)
- Fixed-point arithmetic:
imul, add, sub, sar $0xd
- Clamp to 0-255 using a
cmp $0xff / neg / sar $0x1f pattern
- Byte stores to the output RGBX buffer
- Processes 4 pixels per iteration (2x2 block matching chroma subsampling)
No SIMD, no floating point — pure integer with lots of memory access through lookup tables.
What causes the stutter
Through strace analysis, I can see the movie data reader thread disappears from syscalls for 1.31 seconds (normally it cycles every ~1s). During this gap, no new decoded frames flow into the pipeline and the frame buffer runs dry. It's not I/O — all file reads complete in <1ms. The thread is just stuck executing emulated x86 Theora code.
The question
I wanted to ask for your guidance on the best approach here. A few ideas I had:
-
Check if any x86 instructions in the hot loop fall back to ppc64le_next — there's 1.8% in the interpreter, and if some of that is from this loop, it could be a quick win to add JIT support for those opcodes.
-
Native function replacement — Could box64 intercept calls to decodeRGBX (even though it's statically linked) and replace it with a native PPC64LE YUV→RGBX implementation? Kind of like what GOM does for library functions, but for a game symbol. Not sure if there's precedent for this.
-
JIT codegen quality — The inner loop is mostly mov, movzbl, imul, add, sub, sar, cmp, jbe, neg with heavy use of stack-relative addressing (mov 0x28(%esp),%edi etc.). Is there anything in the PPC64LE dynarec that might generate suboptimal code for these patterns? For instance, the frequent sar $0xd + cmp $0xff + neg + sar $0x1f clamping sequence appears ~10 times.
-
Just accept it — Maybe this is just the inherent cost of emulating a tight pixel-processing loop and there's not much to be done without native Theora libraries.
Any thoughts on which direction would be most productive? Happy to provide more data or test patches.
Environment
- Platform: PPC64LE (POWER9), 32 cores, 64GB RAM
- Build flags:
cmake .. -DPPC64LE=1 -DBOX32=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo
- Game: Grim Fandango Remastered (32-bit i386 ELF, not stripped, 9377 symbols)
- Branch: based on current main with BOX32 wrappers for libogg, libvorbis, pulse-simple
Related: #3577 (dynarec block dispatch overhead on >4KB page systems)
Hey ptitSeb,
I've been investigating a video playback stutter in Grim Fandango Remastered running on PPC64LE (POWER9) with BOX32 enabled. The intro movie stutters visibly for a few frames around the 2:20 mark. Audio plays fine throughout.
After profiling with
perf+BOX64_DYNAREC_PERFMAP=1, the bottleneck is very clear:Profiling results
decodeRGBX__memcpy_power7ppc64le_nextoc_dec_dc_unpredict_mcu_plane_coc_frag_recon_intra_coc_huff_token_decode_cThe game statically links all of its Theora/Ogg/Vorbis code — no shared libraries to wrap.
What decodeRGBX does
It's a straightforward YUV→RGBX pixel converter with a tight inner loop (~450 bytes of x86 at
0x08226570). The loop body does:movzblmovwith scaled index)imul,add,sub,sar $0xdcmp $0xff/neg/sar $0x1fpatternNo SIMD, no floating point — pure integer with lots of memory access through lookup tables.
What causes the stutter
Through strace analysis, I can see the movie data reader thread disappears from syscalls for 1.31 seconds (normally it cycles every ~1s). During this gap, no new decoded frames flow into the pipeline and the frame buffer runs dry. It's not I/O — all file reads complete in <1ms. The thread is just stuck executing emulated x86 Theora code.
The question
I wanted to ask for your guidance on the best approach here. A few ideas I had:
Check if any x86 instructions in the hot loop fall back to
ppc64le_next— there's 1.8% in the interpreter, and if some of that is from this loop, it could be a quick win to add JIT support for those opcodes.Native function replacement — Could box64 intercept calls to
decodeRGBX(even though it's statically linked) and replace it with a native PPC64LE YUV→RGBX implementation? Kind of like what GOM does for library functions, but for a game symbol. Not sure if there's precedent for this.JIT codegen quality — The inner loop is mostly
mov,movzbl,imul,add,sub,sar,cmp,jbe,negwith heavy use of stack-relative addressing (mov 0x28(%esp),%edietc.). Is there anything in the PPC64LE dynarec that might generate suboptimal code for these patterns? For instance, the frequentsar $0xd+cmp $0xff+neg+sar $0x1fclamping sequence appears ~10 times.Just accept it — Maybe this is just the inherent cost of emulating a tight pixel-processing loop and there's not much to be done without native Theora libraries.
Any thoughts on which direction would be most productive? Happy to provide more data or test patches.
Environment
cmake .. -DPPC64LE=1 -DBOX32=1 -DCMAKE_BUILD_TYPE=RelWithDebInfoRelated: #3577 (dynarec block dispatch overhead on >4KB page systems)