opusify is a Rust CLI tool that reads Parquet files containing embedded WAV audio, re-encodes the audio payloads to Opus, and writes the result back to Parquet while preserving the dataset schema.
- scans an input directory recursively for
.parquetfiles - reads rows with the schema:
{"info": {"features": {"audio": {"_type": "Audio"}, "duration": {"dtype": "float64", "_type": "Value"}, "transcription": {"dtype": "string", "_type": "Value"}}}}- expects the Parquet layout to contain:
audio.bytes: WAV file bytesaudio.sampling_rate: sampling rateaudio.path: original audio file namedurationtranscription
- converts
audio.bytesfrom WAV to Ogg Opus - rewrites
audio.pathto use the.opusextension - writes the transformed rows to an output directory with the same relative file layout
- Rust toolchain
- system support for building the
opuscrate and linkinglibopus
cargo build --releasePortable release builds target the baseline x86-64 ISA so the resulting binary
works on older x86_64 CPUs. If you want a host-specific build on your own machine,
override it explicitly:
RUSTFLAGS="-Ctarget-cpu=native" cargo build --releasecargo run --release -- \
--input wav-parquet \
--output opus-parquet--input <PATH> Input directory containing parquet files
--output <PATH> Output directory for converted parquet files
--workers <N> Number of worker threads
--batch-size <N> Parquet batch size used by the Arrow reader
--remove-input-file Remove the source parquet file after successful conversion
--continue-on-error Keep processing after row/file errors and report them in the final summary
--scheduler <MODE> Scheduler mode: auto | files | rows
--bitrate <KBPS> Target Opus bitrate in kbps
--vbr Enable unconstrained variable bitrate
--cvbr Enable constrained variable bitrate
--comp, --complexity <N> Opus encoder complexity (0-10)
--framesize <MS> Opus frame size: 2.5 | 5 | 10 | 20 | 40 | 60 | 80 | 100 | 120
--application <MODE> Opus application: audio | voip | low-delay
--signal <TYPE> Signal hint: auto | voice | music
--music Shortcut for --signal music
--channels <MODE> Force encoded channels: mono | stereo
auto- uses row-level parallelism when processing a single Parquet file
- uses row-level parallelism when there are only a few files and at least one is large
- otherwise uses file-level parallelism
files- parallelizes across Parquet files
- rows inside each batch are processed sequentially
rows- processes files sequentially
- parallelizes audio transcoding across rows inside each batch
auto is the default and is intended to avoid nested rayon overhead.
- default mode is fail-fast
- with
--continue-on-error:- row-level conversion failures are logged and counted
- file-level failures are logged and counted
- the process still exits non-zero if any row or file failed
- failed rows keep
audio.bytes = null - failed rows preserve the original
audio.path
Convert one directory:
cargo run --release -- \
--input wav-parquet \
--output opus-parquet \
--workers 8 \
--batch-size 256Convert and delete source files after each successful output write:
cargo run --release -- \
--input wav-parquet \
--output opus-parquet \
--remove-input-file \
--continue-on-errorForce row-level scheduling for a single large file:
cargo run --release -- \
--input wav-parquet \
--output opus-parquet \
--scheduler rowsASR-oriented encode profile:
cargo run --release -- \
--input wav-parquet \
--output opus-parquet \
--bitrate 32 \
--cvbr \
--comp 10 \
--framesize 20 \
--application audio \
--signal voice \
--channels monoThe tool uses log and env_logger.
Default logging level is info. You can override it with RUST_LOG:
RUST_LOG=debug cargo run --release -- --input wav-parquet --output opus-parquet- output audio is stored as Ogg Opus bytes in
audio.bytes - Parquet schema and Arrow metadata are preserved in the output file
audio.sampling_rateis validated against the WAV header; mismatches are logged as warnings and the WAV header is used- supported input WAV channel counts:
- mono
- stereo
- supported Opus sample rates:
800012000160002400048000
Run checks:
cargo checkRun tests:
cargo testAt the end of each run, the tool logs a summary containing:
- files discovered
- files succeeded
- files failed
- rows converted
- rows failed
- elapsed time
- files/sec
- rows/sec
- effective scheduler mode