Skip to content

RustedBytes/opusify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

opusify

opusify is a Rust CLI tool that reads Parquet files containing embedded WAV audio, re-encodes the audio payloads to Opus, and writes the result back to Parquet while preserving the dataset schema.

What It Does

  • scans an input directory recursively for .parquet files
  • reads rows with the schema:
{"info": {"features": {"audio": {"_type": "Audio"}, "duration": {"dtype": "float64", "_type": "Value"}, "transcription": {"dtype": "string", "_type": "Value"}}}}
  • expects the Parquet layout to contain:
    • audio.bytes: WAV file bytes
    • audio.sampling_rate: sampling rate
    • audio.path: original audio file name
    • duration
    • transcription
  • converts audio.bytes from WAV to Ogg Opus
  • rewrites audio.path to use the .opus extension
  • writes the transformed rows to an output directory with the same relative file layout

Requirements

  • Rust toolchain
  • system support for building the opus crate and linking libopus

Build

cargo build --release

Portable release builds target the baseline x86-64 ISA so the resulting binary works on older x86_64 CPUs. If you want a host-specific build on your own machine, override it explicitly:

RUSTFLAGS="-Ctarget-cpu=native" cargo build --release

Usage

cargo run --release -- \
  --input wav-parquet \
  --output opus-parquet

CLI Flags

--input <PATH>               Input directory containing parquet files
--output <PATH>              Output directory for converted parquet files
--workers <N>                Number of worker threads
--batch-size <N>             Parquet batch size used by the Arrow reader
--remove-input-file          Remove the source parquet file after successful conversion
--continue-on-error          Keep processing after row/file errors and report them in the final summary
--scheduler <MODE>           Scheduler mode: auto | files | rows
--bitrate <KBPS>             Target Opus bitrate in kbps
--vbr                        Enable unconstrained variable bitrate
--cvbr                       Enable constrained variable bitrate
--comp, --complexity <N>     Opus encoder complexity (0-10)
--framesize <MS>             Opus frame size: 2.5 | 5 | 10 | 20 | 40 | 60 | 80 | 100 | 120
--application <MODE>         Opus application: audio | voip | low-delay
--signal <TYPE>              Signal hint: auto | voice | music
--music                      Shortcut for --signal music
--channels <MODE>            Force encoded channels: mono | stereo

Scheduler Modes

  • auto
    • uses row-level parallelism when processing a single Parquet file
    • uses row-level parallelism when there are only a few files and at least one is large
    • otherwise uses file-level parallelism
  • files
    • parallelizes across Parquet files
    • rows inside each batch are processed sequentially
  • rows
    • processes files sequentially
    • parallelizes audio transcoding across rows inside each batch

auto is the default and is intended to avoid nested rayon overhead.

Failure Behavior

  • default mode is fail-fast
  • with --continue-on-error:
    • row-level conversion failures are logged and counted
    • file-level failures are logged and counted
    • the process still exits non-zero if any row or file failed
    • failed rows keep audio.bytes = null
    • failed rows preserve the original audio.path

Examples

Convert one directory:

cargo run --release -- \
  --input wav-parquet \
  --output opus-parquet \
  --workers 8 \
  --batch-size 256

Convert and delete source files after each successful output write:

cargo run --release -- \
  --input wav-parquet \
  --output opus-parquet \
  --remove-input-file \
  --continue-on-error

Force row-level scheduling for a single large file:

cargo run --release -- \
  --input wav-parquet \
  --output opus-parquet \
  --scheduler rows

ASR-oriented encode profile:

cargo run --release -- \
  --input wav-parquet \
  --output opus-parquet \
  --bitrate 32 \
  --cvbr \
  --comp 10 \
  --framesize 20 \
  --application audio \
  --signal voice \
  --channels mono

Logging

The tool uses log and env_logger.

Default logging level is info. You can override it with RUST_LOG:

RUST_LOG=debug cargo run --release -- --input wav-parquet --output opus-parquet

Notes

  • output audio is stored as Ogg Opus bytes in audio.bytes
  • Parquet schema and Arrow metadata are preserved in the output file
  • audio.sampling_rate is validated against the WAV header; mismatches are logged as warnings and the WAV header is used
  • supported input WAV channel counts:
    • mono
    • stereo
  • supported Opus sample rates:
    • 8000
    • 12000
    • 16000
    • 24000
    • 48000

Tests

Run checks:

cargo check

Run tests:

cargo test

At the end of each run, the tool logs a summary containing:

  • files discovered
  • files succeeded
  • files failed
  • rows converted
  • rows failed
  • elapsed time
  • files/sec
  • rows/sec
  • effective scheduler mode

About

Convert audio data in parquet files from WAV to OPUS (OGG)

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages