Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions docs/VOBSUB.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# VOBSUB Subtitle Extraction from MKV Files

CCExtractor supports extracting VOBSUB (S_VOBSUB) subtitles from Matroska (MKV) containers. VOBSUB is an image-based subtitle format originally from DVD video.

## Overview

VOBSUB subtitles consist of two files:
- `.idx` - Index file containing metadata, palette, and timestamp/position entries
- `.sub` - Binary file containing the actual subtitle bitmap data in MPEG Program Stream format

## Basic Usage

```bash
ccextractor movie.mkv
```

This will extract all VOBSUB tracks and create paired `.idx` and `.sub` files:
- `movie_eng.idx` + `movie_eng.sub` (first English track)
- `movie_eng_1.idx` + `movie_eng_1.sub` (second English track, if present)
- etc.

## Converting VOBSUB to SRT (Text)

Since VOBSUB subtitles are images, you need OCR (Optical Character Recognition) to convert them to text-based formats like SRT.

### Using subtile-ocr (Recommended)

[subtile-ocr](https://github.com/gwen-lg/subtile-ocr) is an actively maintained Rust tool that provides accurate OCR conversion.

#### Option 1: Docker (Easiest)

We provide a Dockerfile that builds subtile-ocr with all dependencies:

```bash
# Build the Docker image (one-time)
cd tools/vobsubocr
docker build -t subtile-ocr .

# Extract VOBSUB from MKV
ccextractor movie.mkv

# Convert to SRT using OCR
docker run --rm -v $(pwd):/data subtile-ocr -l eng -o /data/movie_eng.srt /data/movie_eng.idx
```

#### Option 2: Install subtile-ocr Natively

If you have Rust and Tesseract development libraries installed:

```bash
# Install dependencies (Ubuntu/Debian)
sudo apt-get install libleptonica-dev libtesseract-dev tesseract-ocr tesseract-ocr-eng

# Install subtile-ocr
cargo install --git https://github.com/gwen-lg/subtile-ocr

# Convert
subtile-ocr -l eng -o movie_eng.srt movie_eng.idx
```

### subtile-ocr Options

| Option | Description |
|--------|-------------|
| `-l, --lang <LANG>` | Tesseract language code (required). Examples: `eng`, `fra`, `deu`, `chi_sim` |
| `-o, --output <FILE>` | Output SRT file (stdout if not specified) |
| `-t, --threshold <0.0-1.0>` | Binarization threshold (default: 0.6) |
| `-d, --dpi <DPI>` | Image DPI for OCR (default: 150) |
| `--dump` | Save processed subtitle images as PNG files |

### Language Codes

Install additional Tesseract language packs as needed:

```bash
# Examples
sudo apt-get install tesseract-ocr-fra # French
sudo apt-get install tesseract-ocr-deu # German
sudo apt-get install tesseract-ocr-spa # Spanish
sudo apt-get install tesseract-ocr-chi-sim # Simplified Chinese
```

## Technical Details

### .idx File Format

The index file contains:
1. Header with metadata (size, palette, alignment settings)
2. Language identifier line
3. Timestamp entries with file positions

Example:
```
# VobSub index file, v7 (do not modify this line!)
size: 720x576
palette: 000000, 828282, ...

id: eng, index: 0
timestamp: 00:01:12:920, filepos: 000000000
timestamp: 00:01:18:640, filepos: 000000800
...
```

### .sub File Format

The binary file contains MPEG Program Stream packets:
- Each subtitle is wrapped in a PS Pack header (14 bytes) + PES header (15 bytes)
- Subtitles are aligned to 2048-byte boundaries
- Contains raw SPU (SubPicture Unit) bitmap data

## Troubleshooting

### Empty output files
- Ensure the MKV file actually contains VOBSUB tracks (check with `mediainfo` or `ffprobe`)
- CCExtractor will report "No VOBSUB subtitles to write" if the track is empty

### OCR quality issues
- Try adjusting the `-t` threshold parameter
- Ensure the correct language pack is installed
- Use `--dump` to inspect the processed images

### Docker permission issues
- The output files may be owned by root; use `sudo chown` to fix ownership
- Or run Docker with `--user $(id -u):$(id -g)`

## See Also

- [OCR.md](OCR.md) - General OCR support in CCExtractor
- [subtile-ocr GitHub](https://github.com/gwen-lg/subtile-ocr) - OCR tool documentation
241 changes: 232 additions & 9 deletions src/lib_ccx/matroska.c
Original file line number Diff line number Diff line change
Expand Up @@ -1334,11 +1334,243 @@ char *ass_ssa_sentence_erase_read_order(char *text)
return buf;
}

/* VOBSUB support: Generate PS Pack header
* The PS Pack header is 14 bytes:
* - 4 bytes: start code (00 00 01 ba)
* - 6 bytes: SCR (System Clock Reference) in MPEG-2 format
* - 3 bytes: mux rate
* - 1 byte: stuffing length (0)
*/
static void generate_ps_pack_header(unsigned char *buf, ULLONG pts_90khz)
{
// PS Pack start code
buf[0] = 0x00;
buf[1] = 0x00;
buf[2] = 0x01;
buf[3] = 0xBA;

// SCR (System Clock Reference) - use PTS as SCR base, SCR extension = 0
// MPEG-2 format: 01 SCR[32:30] 1 SCR[29:15] 1 SCR[14:0] 1 SCR_ext[8:0] 1
ULLONG scr = pts_90khz;
ULLONG scr_base = scr;
int scr_ext = 0;

buf[4] = 0x44 | ((scr_base >> 27) & 0x38) | ((scr_base >> 28) & 0x03);
buf[5] = (scr_base >> 20) & 0xFF;
buf[6] = 0x04 | ((scr_base >> 12) & 0xF8) | ((scr_base >> 13) & 0x03);
buf[7] = (scr_base >> 5) & 0xFF;
buf[8] = 0x04 | ((scr_base << 3) & 0xF8) | ((scr_ext >> 7) & 0x03);
buf[9] = ((scr_ext << 1) & 0xFE) | 0x01;

// Mux rate (10080 = standard DVD rate)
int mux_rate = 10080;
buf[10] = (mux_rate >> 14) & 0xFF;
buf[11] = (mux_rate >> 6) & 0xFF;
buf[12] = ((mux_rate << 2) & 0xFC) | 0x03;

// Stuffing length = 0, with marker bits
buf[13] = 0xF8;
}

/* VOBSUB support: Generate PES header for private stream 1
* Returns the total header size (variable based on PTS)
*/
static int generate_pes_header(unsigned char *buf, ULLONG pts_90khz, int payload_size, int stream_id)
{
// PES start code for private stream 1
buf[0] = 0x00;
buf[1] = 0x00;
buf[2] = 0x01;
buf[3] = 0xBD; // Private stream 1

// PES packet length = header data (3 + 5 for PTS) + 1 (substream ID) + payload
int pes_header_data_len = 5; // PTS only
int pes_packet_len = 3 + pes_header_data_len + 1 + payload_size;
buf[4] = (pes_packet_len >> 8) & 0xFF;
buf[5] = pes_packet_len & 0xFF;

// PES flags: MPEG-2, original
buf[6] = 0x81;
// PTS_DTS_flags = 10 (PTS only)
buf[7] = 0x80;
// PES header data length
buf[8] = pes_header_data_len;

// PTS (5 bytes): '0010' | PTS[32:30] | '1' | PTS[29:15] | '1' | PTS[14:0] | '1'
buf[9] = 0x21 | ((pts_90khz >> 29) & 0x0E);
buf[10] = (pts_90khz >> 22) & 0xFF;
buf[11] = 0x01 | ((pts_90khz >> 14) & 0xFE);
buf[12] = (pts_90khz >> 7) & 0xFF;
buf[13] = 0x01 | ((pts_90khz << 1) & 0xFE);

// Substream ID (0x20 = first VOBSUB stream)
buf[14] = 0x20 + stream_id;

return 15; // Total PES header size
}

/* VOBSUB support: Generate timestamp string for .idx file
* Format: HH:MM:SS:mmm (where mmm is milliseconds)
*/
static void generate_vobsub_timestamp(char *buf, size_t bufsize, ULLONG milliseconds)
{
ULLONG ms = milliseconds % 1000;
milliseconds /= 1000;
ULLONG seconds = milliseconds % 60;
milliseconds /= 60;
ULLONG minutes = milliseconds % 60;
milliseconds /= 60;
ULLONG hours = milliseconds;

snprintf(buf, bufsize, "%02" LLU_M ":%02" LLU_M ":%02" LLU_M ":%03" LLU_M,
hours, minutes, seconds, ms);
}

/* VOBSUB support: Save VOBSUB track to .idx and .sub files */
#define VOBSUB_BLOCK_SIZE 2048
static void save_vobsub_track(struct matroska_ctx *mkv_ctx, struct matroska_sub_track *track)
{
if (track->sentence_count == 0)
{
mprint("\nNo VOBSUB subtitles to write");
return;
}

// Generate base filename (without extension)
const char *lang_to_use = track->lang_ietf ? track->lang_ietf : track->lang;
const char *basename = get_basename(mkv_ctx->filename);
size_t needed = strlen(basename) + strlen(lang_to_use) + 32;
char *base_filename = malloc(needed);
if (base_filename == NULL)
fatal(EXIT_NOT_ENOUGH_MEMORY, "In save_vobsub_track: Out of memory.");

if (track->lang_index == 0)
snprintf(base_filename, needed, "%s_%s", basename, lang_to_use);
else
snprintf(base_filename, needed, "%s_%s_" LLD, basename, lang_to_use, track->lang_index);

// Create .sub filename
char *sub_filename = malloc(needed + 5);
if (sub_filename == NULL)
fatal(EXIT_NOT_ENOUGH_MEMORY, "In save_vobsub_track: Out of memory.");
snprintf(sub_filename, needed + 5, "%s.sub", base_filename);

// Create .idx filename
char *idx_filename = malloc(needed + 5);
if (idx_filename == NULL)
fatal(EXIT_NOT_ENOUGH_MEMORY, "In save_vobsub_track: Out of memory.");
snprintf(idx_filename, needed + 5, "%s.idx", base_filename);

mprint("\nOutput files: %s, %s", idx_filename, sub_filename);

// Open .sub file
int sub_desc;
#ifdef WIN32
sub_desc = open(sub_filename, O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, S_IREAD | S_IWRITE);
#else
sub_desc = open(sub_filename, O_WRONLY | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR);
#endif
if (sub_desc < 0)
{
mprint("\nError: Cannot create .sub file");
free(base_filename);
free(sub_filename);
free(idx_filename);
return;
}

// Open .idx file
int idx_desc;
#ifdef WIN32
idx_desc = open(idx_filename, O_WRONLY | O_CREAT | O_TRUNC, S_IREAD | S_IWRITE);
#else
idx_desc = open(idx_filename, O_WRONLY | O_CREAT | O_TRUNC, S_IWUSR | S_IRUSR);
#endif
if (idx_desc < 0)
{
mprint("\nError: Cannot create .idx file");
close(sub_desc);
free(base_filename);
free(sub_filename);
free(idx_filename);
return;
}

// Write .idx header (from CodecPrivate)
if (track->header != NULL)
write_wrapped(idx_desc, track->header, strlen(track->header));

// Add language identifier line
char lang_line[128];
snprintf(lang_line, sizeof(lang_line), "\nid: %s, index: 0\n", lang_to_use);
write_wrapped(idx_desc, lang_line, strlen(lang_line));

// Buffer for PS/PES headers and padding
unsigned char header_buf[32];
unsigned char zero_buf[VOBSUB_BLOCK_SIZE];
memset(zero_buf, 0, VOBSUB_BLOCK_SIZE);

ULLONG file_pos = 0;

// Write each subtitle
for (int i = 0; i < track->sentence_count; i++)
{
struct matroska_sub_sentence *sentence = track->sentences[i];
mkv_ctx->sentence_count++;

// Convert timestamp to 90kHz PTS
ULLONG pts_90khz = sentence->time_start * 90;

// Write timestamp entry to .idx
char timestamp[32];
generate_vobsub_timestamp(timestamp, sizeof(timestamp), sentence->time_start);
char idx_entry[128];
snprintf(idx_entry, sizeof(idx_entry), "timestamp: %s, filepos: %09" LLX_M "\n",
timestamp, file_pos);
write_wrapped(idx_desc, idx_entry, strlen(idx_entry));

// Generate PS Pack header (14 bytes)
generate_ps_pack_header(header_buf, pts_90khz);
write_wrapped(sub_desc, (char *)header_buf, 14);

// Generate PES header (15 bytes)
int pes_header_len = generate_pes_header(header_buf, pts_90khz, sentence->text_size, 0);
write_wrapped(sub_desc, (char *)header_buf, pes_header_len);

// Write SPU data
write_wrapped(sub_desc, sentence->text, sentence->text_size);

// Calculate bytes written and pad to block boundary
ULLONG bytes_written = 14 + pes_header_len + sentence->text_size;
ULLONG padding_needed = VOBSUB_BLOCK_SIZE - (bytes_written % VOBSUB_BLOCK_SIZE);
if (padding_needed < VOBSUB_BLOCK_SIZE)
{
write_wrapped(sub_desc, (char *)zero_buf, padding_needed);
bytes_written += padding_needed;
}

file_pos += bytes_written;
}

close(sub_desc);
close(idx_desc);
free(base_filename);
free(sub_filename);
free(idx_filename);
}

void save_sub_track(struct matroska_ctx *mkv_ctx, struct matroska_sub_track *track)
{
char *filename;
int desc;

// VOBSUB tracks need special handling - separate .idx and .sub files
if (track->codec_id == MATROSKA_TRACK_SUBTITLE_CODEC_ID_VOBSUB)
{
save_vobsub_track(mkv_ctx, track);
return;
}

if (mkv_ctx->ctx->cc_to_stdout == CCX_TRUE)
{
desc = 1; // file descriptor of stdout
Expand All @@ -1358,11 +1590,6 @@ void save_sub_track(struct matroska_ctx *mkv_ctx, struct matroska_sub_track *tra
if (track->header != NULL)
write_wrapped(desc, track->header, strlen(track->header));

if (track->codec_id == MATROSKA_TRACK_SUBTITLE_CODEC_ID_VOBSUB)
{
mprint("\nError: VOBSUB not supported");
}

for (int i = 0; i < track->sentence_count; i++)
{
struct matroska_sub_sentence *sentence = track->sentences[i];
Expand Down Expand Up @@ -1497,10 +1724,6 @@ void save_sub_track(struct matroska_ctx *mkv_ctx, struct matroska_sub_track *tra
free(timestamp_start);
free(timestamp_end);
}
else if (track->codec_id == MATROSKA_TRACK_SUBTITLE_CODEC_ID_VOBSUB)
{
// TODO: Add support for VOBSUB
}
}
}

Expand Down
Loading
Loading