Skip to content

Conversation

@cfsmp3
Copy link
Contributor

@cfsmp3 cfsmp3 commented Dec 29, 2025

Summary

  • Add OCR character blacklist (enabled by default) to prevent common misrecognition errors like I|
  • Add optional --ocr-line-split mode for multi-line subtitle images
  • Inspired by subtile-ocr's proven approach

New Options

Option Default Description
--no-ocr-blacklist Blacklist ON Disable the character blacklist (|, \, `, _, ~)
--ocr-line-split OFF Split images into lines, use PSM 7 for each

Test Results (VOBSUB MKV sample)

Metric Before After (blacklist)
Pipe | errors 14 0

The blacklist completely eliminates pipe character misrecognition, matching subtile-ocr's accuracy.

Test plan

  • Build with OCR support
  • Test VOBSUB extraction with blacklist (default)
  • Test with --no-ocr-blacklist to verify it can be disabled
  • Test --ocr-line-split option
  • Verify help text displays correctly

🤖 Generated with Claude Code

cfsmp3 and others added 2 commits December 29, 2025 11:33
…accuracy

Add two new OCR options to improve subtitle recognition:

1. Character blacklist (enabled by default):
   - Blacklists characters |, \, `, _, ~ that are commonly misrecognized
   - Prevents "I" being recognized as "|" (pipe character)
   - Use --no-ocr-blacklist to disable if needed

2. Line-split mode (opt-in via --ocr-line-split):
   - Splits multi-line subtitle images into individual lines
   - Uses PSM 7 (single text line mode) for each line
   - Adds 10px padding around each line for better edge recognition
   - May improve accuracy for some VOBSUB subtitles

Test results with VOBSUB sample:
- Blacklist: Reduces pipe errors from 14 to 0
- Matches subtile-ocr's approach for preventing misrecognition

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use tabs for continuation indentation in C code (clang-format)
- Remove extra trailing spaces in Rust code (rustfmt)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 730156f...:
Report Name Tests Passed
Broken 13/13
CEA-708 14/14
DVB 6/7
DVD 3/3
DVR-MS 2/2
General 25/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 81/86
Teletext 21/21
WTV 13/13
XDS 34/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65...
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b...
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

@ccextractor-bot
Copy link
Collaborator

CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 730156f...:
Report Name Tests Passed
Broken 13/13
CEA-708 14/14
DVB 6/7
DVD 3/3
DVR-MS 2/2
General 25/27
Hardsubx 1/1
Hauppage 3/3
MP4 3/3
NoCC 10/10
Options 80/86
Teletext 21/21
WTV 13/13
XDS 34/34

Your PR breaks these cases:

  • ccextractor --autoprogram --out=srt --latin1 --quant 0 85271be4d2...
  • ccextractor --autoprogram --out=ttxt --latin1 --ucla dab1c1bd65...
  • ccextractor --out=srt --latin1 --autoprogram 29e5ffd34b...
  • ccextractor --out=spupng c83f765c66...
  • ccextractor --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotbefore 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsnotafter 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatleast 1 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...
  • ccextractor --startcreditsforatmost 2 --startcreditstext "CCextractor Start crdit Testing" c4dd893cb9...

It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you).

Check the result page for more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants