This project implements a system for storing arbitrary binary data within a video stream. Here is the technical breakdown of the process.
The encoder treats a video frame as a grid of data "blocks".
Given a resolution (e.g., 1280x720) and a block_size (e.g., 16 pixels):
- The frame is divided into
(1280/16) * (720/16)blocks. - With a block size of 16, this creates an 80x45 grid, allowing for 3,600 bits (450 bytes) per frame.
- Each bit of binary data is mapped to a block.
- Bit 1: The entire block is filled with white pixels (value 255).
- Bit 0: The entire block is filled with black pixels (value 0).
To ensure the data can be reassembled correctly and verified, each frame includes a 16-byte header:
- Frame Index (4B): The sequence number of the frame.
- Total Frames (4B): The total number of frames in the sequence.
- Data Length (4B): The number of actual data bytes in this specific frame.
- CRC32 Checksum (4B): A checksum of the payload data for integrity verification.
The primary challenge of using YouTube is lossy compression. Platforms like YouTube use aggressive codecs (H.264, VP9, AV1) that discard high-frequency information to save space.
If we used 1x1 pixel blocks, video compression would blur the edges, making it impossible to distinguish between a 0 and a 1. By using larger blocks (16x16), we create redundancy. Even if the edges of the block are blurred or "artifacted," the center of the block remains clearly black or white.
The system can automatically choose an optimal block size based on the input file size when --block-size 0 is specified:
- Small files (<100KB): Uses 32px blocks for extreme robustness.
- Medium files: Uses 16px blocks (default).
- Large files (>10MB): Uses 8px blocks to maximize density.
During decoding, if --block-size 0 is passed, the decoder attempts to "probe" the first frame with various common block sizes (8, 16, 32, etc.) until it finds one that yields a valid header and passing CRC.
- When encoding locally, we use the FFV1 codec (lossless) or MP4V.
- When YouTube processes the video, it re-encodes it. The large block size ensures that the "signal" (the black/white blocks) survives this transformation.
The decoder reverses the mapping:
- It reads the video frame by frame.
- It samples the center pixel of each block using optimized NumPy grid sampling.
- If the pixel value is
> 128, it registers a1; otherwise, a0. - It parses the 12-byte header to identify the frame's position and payload size.
- Once all frames are collected, it reassembles them into the original binary file.
Encoding and decoding are CPU-intensive due to Reed-Solomon calculations and image processing. The system utilizes Python's ProcessPoolExecutor to distribute frame processing across all available CPU cores.
- Encoder: Generates multiple frames in parallel and writes them to the video stream in sequence.
- Decoder: Reads frames sequentially (as required by OpenCV) but offloads the grayscale conversion, sampling, and Reed-Solomon decoding to a pool of worker processes.
Instead of iterating through pixels with nested loops, the system uses NumPy's repeat and reshape functions for frame generation and advanced indexing for grid sampling, drastically reducing the overhead of processing high-resolution video frames.
Since the decoded file is just a stream of bytes, the restore_format.py script uses magic bytes (via the file utility) to identify what the original file extension was (e.g., .zip, .pdf, .mp4) and renames it accordingly.