Video stutter in Grim Fandango Remastered: decodeRGBX (YUV→RGBX) dominates CPU in emulated Theora playback

Hey ptitSeb,

I've been investigating a video playback stutter in Grim Fandango Remastered running on PPC64LE (POWER9) with BOX32 enabled. The intro movie stutters visibly for a few frames around the 2:20 mark. Audio plays fine throughout.

After profiling with `perf` + `BOX64_DYNAREC_PERFMAP=1`, the bottleneck is very clear:

## Profiling results

| % CPU | Symbol | What it is |
|-------|--------|------------|
| **~35-40%** | `decodeRGBX` | YUV→RGBX color conversion (statically linked Theora) |
| 3.3% | `__memcpy_power7` | native memcpy |
| 1.8% | `ppc64le_next` | dynarec interpreter fallback |
| ~1.6% | `oc_dec_dc_unpredict_mcu_plane_c` | Theora DC coefficient decode |
| ~0.6% | `oc_frag_recon_intra_c` | Theora fragment reconstruction |
| ~0.4% | `oc_huff_token_decode_c` | Theora Huffman decode |

The game statically links all of its Theora/Ogg/Vorbis code — no shared libraries to wrap.

## What decodeRGBX does

It's a straightforward YUV→RGBX pixel converter with a tight inner loop (~450 bytes of x86 at `0x08226570`). The loop body does:
- Reads Y, Cb, Cr from separate planes via `movzbl`
- Table lookups from 5 fixed-point coefficient tables (via `mov` with scaled index)
- Fixed-point arithmetic: `imul`, `add`, `sub`, `sar $0xd`
- Clamp to 0-255 using a `cmp $0xff` / `neg` / `sar $0x1f` pattern
- Byte stores to the output RGBX buffer
- Processes 4 pixels per iteration (2x2 block matching chroma subsampling)

No SIMD, no floating point — pure integer with lots of memory access through lookup tables.

## What causes the stutter

Through strace analysis, I can see the movie data reader thread disappears from syscalls for **1.31 seconds** (normally it cycles every ~1s). During this gap, no new decoded frames flow into the pipeline and the frame buffer runs dry. It's not I/O — all file reads complete in <1ms. The thread is just stuck executing emulated x86 Theora code.

## The question

I wanted to ask for your guidance on the best approach here. A few ideas I had:

1. **Check if any x86 instructions in the hot loop fall back to `ppc64le_next`** — there's 1.8% in the interpreter, and if some of that is from this loop, it could be a quick win to add JIT support for those opcodes.

2. **Native function replacement** — Could box64 intercept calls to `decodeRGBX` (even though it's statically linked) and replace it with a native PPC64LE YUV→RGBX implementation? Kind of like what GOM does for library functions, but for a game symbol. Not sure if there's precedent for this.

3. **JIT codegen quality** — The inner loop is mostly `mov`, `movzbl`, `imul`, `add`, `sub`, `sar`, `cmp`, `jbe`, `neg` with heavy use of stack-relative addressing (`mov 0x28(%esp),%edi` etc.). Is there anything in the PPC64LE dynarec that might generate suboptimal code for these patterns? For instance, the frequent `sar $0xd` + `cmp $0xff` + `neg` + `sar $0x1f` clamping sequence appears ~10 times.

4. **Just accept it** — Maybe this is just the inherent cost of emulating a tight pixel-processing loop and there's not much to be done without native Theora libraries.

Any thoughts on which direction would be most productive? Happy to provide more data or test patches.

## Environment
- **Platform**: PPC64LE (POWER9), 32 cores, 64GB RAM
- **Build flags**: `cmake .. -DPPC64LE=1 -DBOX32=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo`
- **Game**: Grim Fandango Remastered (32-bit i386 ELF, not stripped, 9377 symbols)
- **Branch**: based on current main with BOX32 wrappers for libogg, libvorbis, pulse-simple

Related: #3577 (dynarec block dispatch overhead on >4KB page systems)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Video stutter in Grim Fandango Remastered: decodeRGBX (YUV→RGBX) dominates CPU in emulated Theora playback #3599

Profiling results

What decodeRGBX does

What causes the stutter

The question

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

% CPU	Symbol	What it is
~35-40%	`decodeRGBX`	YUV→RGBX color conversion (statically linked Theora)
3.3%	`__memcpy_power7`	native memcpy
1.8%	`ppc64le_next`	dynarec interpreter fallback
~1.6%	`oc_dec_dc_unpredict_mcu_plane_c`	Theora DC coefficient decode
~0.6%	`oc_frag_recon_intra_c`	Theora fragment reconstruction
~0.4%	`oc_huff_token_decode_c`	Theora Huffman decode

Uh oh!

Video stutter in Grim Fandango Remastered: decodeRGBX (YUV→RGBX) dominates CPU in emulated Theora playback #3599

Description

Profiling results

What decodeRGBX does

What causes the stutter

The question

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions