Skip to content

Conversation

@cfsmp3
Copy link
Contributor

@cfsmp3 cfsmp3 commented Dec 28, 2025

Summary

  • Adds support for extracting VOBSUB (bitmap) subtitles from MP4 files with OCR conversion to text formats
  • Creates shared vobsub_decoder module for SPU parsing and OCR integration
  • Detects subp:MPEG tracks in MP4 container and processes them through OCR pipeline

Changes

New Files

  • src/lib_ccx/vobsub_decoder.c - VOBSUB decoder with SPU parsing and OCR
  • src/lib_ccx/vobsub_decoder.h - Public API header

Modified Files

  • src/lib_ccx/mp4.c - Add VOBSUB track detection and processing
  • src/lib_ccx/matroska.c - Integrate shared VOBSUB decoder for MKV OCR support

Features

The VOBSUB decoder module provides:

  • SPU control sequence parsing (timing, colors, coordinates)
  • RLE-encoded bitmap decoding (interlaced format)
  • Palette parsing from idx header format
  • Integration with Tesseract OCR via ocr_rect()

Test Results

Tested with sample from issue #1349:

Track 3, type=subp subtype=MPEG
MP4: found 4 tracks: 1 avc, 0 hevc, 1 cc, 1 vobsub
Processing VOBSUB track (128 samples)
VOBSUB processing complete

Successfully extracted 61 subtitles with accurate OCR output:

1
00:22:31,000 --> 00:22:32,874
I have a message from the Shield of Light

2
00:22:33,333 --> 00:22:35,040
Remember the old ways

Test plan

  • Compiles with OCR support (-DWITH_OCR=ON)
  • Compiles without OCR support
  • Extracts VOBSUB from MP4 with accurate OCR text
  • Proper error message when OCR not available

Fixes #1349

🤖 Generated with Claude Code

cfsmp3 and others added 6 commits December 28, 2025 17:32
Add support for extracting VOBSUB (bitmap) subtitles from MP4 files
and converting them to text formats via OCR. This complements the
existing MKV VOBSUB support added in commit 1fccb78.

Changes:
- Add shared vobsub_decoder module for SPU parsing and OCR
- Add process_vobsub_track() function in mp4.c for subp:MPEG tracks
- Detect and count VOBSUB tracks in MP4 container
- Extract palette from decoder config when available
- Process SPU samples through OCR pipeline

The VOBSUB decoder module provides:
- SPU control sequence parsing (timing, colors, coordinates)
- RLE-encoded bitmap decoding (interlaced format)
- Palette parsing from idx header format
- Integration with Tesseract OCR via ocr_rect()

Tested with sample from issue #1349 - successfully extracted 61
subtitles from 128 SPU samples with accurate OCR text output.

Fixes #1349

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The ocr_text field in struct cc_bitmap is only defined when ENABLE_OCR
is set. Wrap the free() calls with #ifdef ENABLE_OCR to fix build
failures in non-OCR configurations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add vobsub_decoder.c and vobsub_decoder.h to linux and mac Makefile.am
to fix autoconf build failures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add vobsub_decoder.c and vobsub_decoder.h to the Visual Studio project
and filters files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 5beb438...:
Report Name Tests Passed
Broken 13/13
CEA-708 14/14
DVB 7/7
DVD 3/3
DVR-MS 2/2
General 27/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 86/86
Teletext 21/21
WTV 13/13
XDS 34/34

Congratulations: Merging this PR would fix the following tests:

  • ccextractor --autoprogram --out=ttxt --latin1 1974a299f0..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 132d7df7e9..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 99e5eaafdc..., Last passed: Never
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65..., Last passed: Never
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b..., Last passed: Never
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9..., Last passed: Never

All tests passed completely.

Check the result page for more info.

@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit ec30a79...:
Report Name Tests Passed
Broken 13/13
CEA-708 14/14
DVB 6/7
DVD 3/3
DVR-MS 2/2
General 25/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 80/86
Teletext 21/21
WTV 13/13
XDS 34/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65...
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b...
  • ccextractor --out=spupng c83f765c66...
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@cfsmp3 cfsmp3 merged commit 63dde6f into master Dec 29, 2025
23 of 24 checks passed
@cfsmp3 cfsmp3 deleted the fix/issue-1371-mkv-vobsub-support branch December 29, 2025 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Not extracting dvbsub from mp4

3 participants