-
Notifications
You must be signed in to change notification settings - Fork 505
fix(708): Support Korean EUC-KR encoding in CEA-708 decoder #1871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4604869 to
5ee8c54
Compare
Korean broadcasts use EUC-KR encoding (variable-width) in CEA-708 captions, where ASCII is 1 byte and Korean characters are 2 bytes. The decoder was always writing 2 bytes per character (UTF-16BE style), causing NULL bytes to be inserted before every ASCII character. Changes: - Add is_utf16_charset() to detect fixed-width 16-bit encodings - Modify write_char() to accept use_utf16 flag: - true: Always 2 bytes (UTF-16BE for Japanese, issue CCExtractor#1451) - false: 1 byte for ASCII, 2 bytes for extended (EUC-KR for Korean) - Detect charset type in write_row() before building output buffer This fixes Korean subtitle extraction when using --service "1[EUC-KR]" while maintaining compatibility with Japanese UTF-16BE (issue CCExtractor#1451). Closes CCExtractor#1065 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The Rust FFI functions were using c_long for PTS/FTS timestamps, but: - C code uses LLONG (int64_t, 64 bits on all platforms) - Rust c_long is 32 bits on Windows, 64 bits on Linux This caused timestamp truncation on Windows when PTS values exceeded 2^31 (~24 days at 90kHz), resulting in wrong subtitle timestamps. For example, a file with Min PTS of 23:50:45 (7,726,090,500 ticks) would have its PTS truncated, breaking the teletext delta calculation that normalizes timestamps to start at 0. Changes: - ccxr_add_current_pts: pts parameter i64 - ccxr_set_current_pts: pts parameter i64 - ccxr_get_fts: return type i64 - ccxr_get_visible_end: return type i64 - ccxr_get_visible_start: return type i64 - ccxr_get_fts_max: return type i64 - ccxr_print_mstime_static: mstime parameter i64 - fts_at_gop_start: extern static i64 Fixes tests 18 and 19 on Windows CI which showed raw PTS timestamps (23:50:46) instead of normalized timestamps (00:00:00). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
On Windows, c_long is i32 while on Linux it's i64. The function ccxr_print_mstime_static expects i64, so casting to c_long caused a type mismatch error on Windows builds. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
084557b to
73cd19f
Compare
CCExtractor CI platform finished running the test files on linux. Below is a summary of the test results, when compared to test for commit 6503502...:
NOTE: The following tests have been failing on the master branch as well as the PR:
Congratulations: Merging this PR would fix the following tests:
All tests passing on the master branch were passed completely. Check the result page for more info. |
CCExtractor CI platform finished running the test files on windows. Below is a summary of the test results, when compared to test for commit 6503502...:
Your PR breaks these cases:
Congratulations: Merging this PR would fix the following tests:
It seems that not all tests were passed completely. This is an indication that the output of some files is not as expected (but might be according to you). Check the result page for more info. |
Summary
Korean broadcasts use EUC-KR encoding (variable-width) in CEA-708 captions, where ASCII is 1 byte and Korean characters are 2 bytes. The decoder was always writing 2 bytes per character (UTF-16BE style), causing NULL bytes (0x00) to be inserted before every ASCII character (spaces, punctuation).
Changes
is_utf16_charset()function to detect fixed-width 16-bit encodings (UTF-16BE, UCS-2)write_char()to acceptuse_utf16flag:true: Always 2 bytes (UTF-16BE for Japanese/Chinese, maintains fix for [BUG] A mix of 8-bit/16-bit chars sent to iconv #1451)false: 1 byte for ASCII, 2 bytes for extended chars (EUC-KR for Korean)write_row()before building output bufferBefore fix
After fix
Test plan
mbc.ts) - drama dialog extracted correctly0623_215529_CH9-1_KBS.mpg) - news broadcast extracted correctly--service "1[EUC-KR]"Closes #1065
🤖 Generated with Claude Code