Version: 3.0
Updated: 2026-04-13
Baseline: v1.9.26 (254 routes, 68 core modules, 17 blueprints, 867 tests)
Feature Plan: 302 features across 62 categories (see features.md)
⚡ Active work (2026-Q2 → 2026-Q4) lives in ROADMAP-NEXT.md — Wave A/B/C/D/E covering v1.18.0 → v1.23.0+ from the cross-project OSS research pass. This file is preserved as the long-range 302-feature plan and the v1.14.0 strategic-gap analysis below.
- Never break what works — Every wave ships a working product. No "rewrite everything then test."
- Incremental migration — New code coexists with old. Feature flags gate rollout. Old paths removed only after new paths are proven.
- User-facing value first — Each wave delivers visible improvements, not just internal refactors.
- Measure before optimizing — Add telemetry/logging before assuming bottlenecks.
- Shared infrastructure first — When multiple features need the same foundation (e.g., object tracking, spectral analysis), build the foundation once, then fan out.
- One new dependency per feature maximum — Avoid dep explosion. Prefer extending existing deps (OpenCV, FFmpeg, Pillow) over adding new ones.
Completed work (v1.0 - v1.9.26) moved to ROADMAP-COMPLETED.md.
Features are organized into 7 waves based on dependency chains, shared infrastructure, and priority. Each wave is independently shippable. Feature numbers reference features.md.
| Symbol | Meaning |
|---|---|
| FFmpeg | Pure FFmpeg filter — no Python deps beyond subprocess |
| Pillow | Image composition — already installed |
| OpenCV | Computer vision — already installed (opencv-python-headless) |
| Existing AI | Uses models already in the codebase (Whisper, Demucs, face detection, etc.) |
| New dep | Requires a new pip dependency |
| New model | Requires downloading a new AI model (potentially large) |
| Pipeline | Orchestrates existing modules — no new deps |
Goal: Ship 40+ features using only existing FFmpeg filters, Pillow, NumPy, and current AI models. Maximum user value with minimum risk.
Timeline: 4-6 weeks New deps: Zero New routes: ~35
These are pure FFmpeg filter additions — each is a new route calling run_ffmpeg() with a new filter graph.
| # | Feature | Effort | Dep | Detail |
|---|---|---|---|---|
| 53.2 | Adaptive Deinterlacing | S | FFmpeg | yadif/bwdif filter. Auto-detect via ffprobe field_order or idet filter. |
| 52.1 | Lens Distortion Correction | M | FFmpeg | lenscorrection filter with k1/k2 coefficients. Ship camera profile JSON (source: lensfun). |
| 52.3 | Chromatic Aberration Removal | S | FFmpeg | chromanr filter or per-channel scale via split/scale/merge. |
| 53.5 | Frame Rate Conversion (Optical Flow) | M | FFmpeg | minterpolate filter for up/down conversion. Preset modes. |
| 44.1 | Timecode Burn-In Overlay | S | FFmpeg | drawtext with %{pts\:hms} or timecode option. Configurable position/font. |
| 45.2 | AV1 Encoding Support | M | FFmpeg | libaom-av1 or libsvtav1 encoder. Add to export presets and social platform presets. |
| 45.1 | ProRes Export on Windows | M | FFmpeg | prores_ks encoder. Profile selector (Proxy/LT/422/HQ/4444). |
| 32.1 | Hardware-Accelerated Encoding | M | FFmpeg | Detect NVENC/QSV/AMF. Add h264_nvenc/hevc_nvenc codec options in export. |
| 20.4 | Photosensitive Seizure Detection | S | FFmpeg | Frame-to-frame luminance delta analysis. Flag >3 flashes/sec per ITU-R BT.1702. |
| 38.1 | GIF / WebP / APNG Export | S | FFmpeg | gif/libwebp_anim output format. Palette optimization via palettegen/paletteuse. |
| 3.10 | Film Grain & Vignette (Enhanced) | S | FFmpeg | noise + vignette filters with presets (Super 8, 16mm, 35mm, VHS). |
| 25.1 | Dialogue De-Reverb | M | FFmpeg | arnndn or afftdn with speech-optimized profile. |
| 42.2 | Timelapse Deflicker | M | FFmpeg | deflicker filter or rolling-average luminance normalization per frame. |
| 30.3 | Freeze Frame Insert | S | FFmpeg | Extract frame at timestamp, generate still clip of configurable duration, splice into sequence. |
Image composition overlays using existing Pillow renderer.
| # | Feature | Effort | Dep | Detail |
|---|---|---|---|---|
| 61.1 | Composition Guide Overlay | S | Pillow | Rule-of-thirds, golden ratio, center cross, safe areas on preview frame. Display-only. |
| 36.1 | Platform Safe Zone Overlay | S | Pillow | TikTok/YouTube/Instagram UI element overlays on preview frame. JSON-driven zone definitions. |
| 34.1 | Scrolling Credits Generator | M | Pillow | Bottom-to-top scroll rendered as video via Pillow frame sequence + FFmpeg encode. |
| 34.3 | Lower Third Generator | M | Pillow | Name/title bar with configurable style presets. Burn into video at timestamp range. |
| 20.3 | Color Blind Simulation Preview | S | Pillow | Apply CVD color matrix (deuteranopia, protanopia, tritanopia) to preview frame. |
| 11.2 | Click & Keystroke Overlay | M | Pillow | Parse click/key logs → render ripple animations and key badges as overlay frames. |
| 11.3 | Callout & Annotation Generator | M | Pillow | Numbered callouts, spotlight boxes, blur regions, arrows at timestamps. |
| 18.2 | Retro VHS / CRT Effect | M | Pillow+FFmpeg | Scanlines, chroma shift, noise, tracking artifacts, date stamp. Preset chain. |
| 18.3 | Glitch Effect Pack | M | Pillow+FFmpeg | Datamosh, RGB shift, block displacement, scan distortion. Per-frame render. |
| 48.1 | Highlight Reel Auto-Assembly | M | Pipeline | Score clips by audio energy + motion → select top N → assemble with transitions + music. |
Features that extend already-installed AI models with new analysis modes.
| # | Feature | Effort | Dep | Detail |
|---|---|---|---|---|
| 55.3 | Profanity Bleep Automation | S | Existing AI | Whisper word timestamps + configurable word list → 1kHz tone or silence at flagged words. |
| 61.2 | Shot Type Auto-Classification | M | Existing AI | Face size relative to frame (MediaPipe) → ECU/CU/MCU/MS/WS classification per scene. |
| 29.1 | Shot Type Search & Tagging | M | Existing AI | Store shot type in footage index (FTS5). Enable search by shot type. |
| 56.4 | Room Tone Auto-Generation | M | NumPy | Analyze quiet segments → spectral envelope → shape white noise to match → fill cuts. |
| 61.3 | Intelligent Pacing Analysis | M | Existing AI | Scene detection cut points → mean/median/stddev shot lengths → genre benchmark comparison. |
| 28.1 | Black Frame / Frozen Frame Detection | S | FFmpeg+OpenCV | blackdetect filter + frame differencing for frozen frames. Report timestamps. |
| 28.2 | Audio Phase & Silence Gap Check | S | FFmpeg | aphasemeter + silence detection. Flag phase issues and unnatural gaps. |
| 4.8 | Best Take Selection | M | Existing AI | Per-take scoring: audio quality (SNR), face visibility, sharpness, duration. Rank takes. |
| 11.5 | Dead-Time Detection & Speed Ramp | S | Existing AI | Frame differencing (scene_detect) + silence detection → speed-ramp or cut dead time. |
| 52.4 | Lens Profile Auto-Detection | S | FFmpeg | Parse camera model from ffprobe metadata → look up in lensfun JSON database. |
New composite video modes using FFmpeg overlay/hstack/vstack.
| # | Feature | Effort | Dep | Detail |
|---|---|---|---|---|
| 57.1 | Split-Screen Layout Templates | M | FFmpeg | JSON layout definitions (cells with x/y/w/h %). Composite via overlay filter chain. |
| 57.2 | Reaction Video Template | M | FFmpeg | Main content + PiP webcam. Auto-sync via audio cross-correlation. Audio ducking. |
| 57.3 | Before/After Comparison Export | M | FFmpeg | hstack/vstack, animated wipe via overlay + keyframed crop. Label overlay. |
| 57.4 | Multi-Cam Grid View Export | M | FFmpeg | 2x2 to 4x4 grid. Optional active-speaker highlight border from diarization data. |
| 6.3 | Side-by-Side Before/After Preview | M | FFmpeg | Preview modal showing original vs processed frame. Slider wipe in panel. |
| 3.9 | Multi-Camera Audio Sync | M | FFmpeg+NumPy | Audio fingerprint cross-correlation for time offset detection. Multicam XML output. |
Wave 1 Total: ~40 features, 0 new dependencies, ~35 new routes
Goal: Build high-value composite workflows that chain existing modules into new products. These are the features that competitors charge monthly subscriptions for.
Timeline: 3-5 weeks (can overlap with Wave 1) New deps: Zero (all existing) New routes: ~20
| # | Feature | Effort | Dep | Detail |
|---|---|---|---|---|
| 58.1 | Long-Form to Multi-Short Extraction | L | Pipeline | Transcribe → LLM highlights (N clips) → per-clip: trim + face-reframe 9:16 + burn captions + export. Folder of numbered shorts + metadata CSV. |
| 58.4 | Podcast Episode Bundle | M | Pipeline | Denoise + normalize → clean audio export → transcribe → chapters → highlight clips → audiogram → show notes → transcript. All outputs in timestamped folder. |
| 54.4 | AI Video Summary / Condensed Recap | M | Pipeline | Scene detect → transcript LLM analysis → engagement scoring → select top N shots → trim 3-5s each → assemble with crossfades. Configurable target duration. |
| 58.2 | Video-to-Blog-Post Generator | M | Pipeline | Transcribe → LLM structured article with section headings → extract key frames at section boundaries → assemble markdown + images folder. |
| 58.3 | Social Media Caption Generator | S | Pipeline | Per-exported-clip: extract transcript → LLM generates platform-optimized post caption (char limits, hashtags, tone). JSON output alongside each clip. |
| # | Feature | Effort | Dep | Detail |
|---|---|---|---|---|
| 53.3 | Old Footage Restoration Pipeline | L | Pipeline | Stabilize → deinterlace (53.2) → denoise (temporal) → upscale (Real-ESRGAN) → color restore → frame rate conversion. VHS/8mm/Early Digital presets. |
| 40.3 | Video Podcast to Audio-Only | S | FFmpeg | Extract audio track, normalize, denoise, export as podcast-ready MP3/WAV with ID3 tags. |
| 40.4 | Podcast Show Notes Generator | M | Pipeline | Transcribe → LLM: summary, key topics with timestamps, pull quotes, mentioned resources, chapter markers. Markdown/HTML output. |
| 12.3 | Auto Montage Builder | M | Pipeline | Score clips (audio energy + motion) → select top N → detect beats in music track → trim clips to beat intervals → concatenate with transitions. |
| 14.1 | Paper Edit / Script Sync | L | Pipeline | Import script text → fuzzy-match against transcript → generate organized clip assembly with confidence scores. |
| 4.1 | Watch Folder / Hot Folder | M | Pipeline | Monitor directory for new files → auto-run configured workflow → output to destination folder. Background polling with configurable interval. |
| 4.2 | Render Queue | M | Pipeline | Queue multiple export jobs with different settings. Sequential execution with progress tracking. Notification on batch completion. |
| 5.1 | Multi-Platform Batch Publish | L | Pipeline | Single source → batch export for YouTube + TikTok + Instagram + LinkedIn with per-platform reframe, caption style, loudness target, and metadata. |
| # | Feature | Effort | Dep | Detail |
|---|---|---|---|---|
| 24.1 | Shot-Change-Aware Subtitle Timing | M | Pipeline | Scene detection (existing) → post-process captions: split at cut boundaries with minimum gap. Integrate into caption generation pipeline. |
| 16.1 | Beat-Synced Auto-Edit | L | Pipeline | Detect beats (existing librosa) → scene detect → align cuts to nearest beat → assemble. Music video editing automation. |
| 36.4 | Vertical-First Intelligent Reframe | M | Pipeline | Saliency detection + face tracking → auto-crop to 9:16 with smooth path. Better than center-crop for non-face content. |
| 30.1 | Ripple Trim / Gap Close | M | ExtendScript | After cut application, auto-close gaps by rippling subsequent clips. ExtendScript removeEmptyTrackItems(). |
Wave 2 Total: ~17 features, 0 new dependencies, ~20 new routes
Goal: Complete the remaining architectural work that enables heavy AI features in Waves 4-7. These are not user-facing but are prerequisites for scale.
Timeline: 6-10 weeks (runs in parallel with Waves 1-2) Dependencies: Internal refactoring
The single most important infrastructure change. Every AI feature in Waves 4-7 benefits from this.
| Task | Detail |
|---|---|
| Worker pool architecture | opencut/workers/ with WorkerManager. Workers are separate Python processes per model family (whisper, demucs, realesrgan, depth, generation). |
| IPC protocol | Workers communicate via localhost HTTP (minimal Flask on random port) or multiprocessing.Queue. Job dispatcher routes by type. |
| GPU memory management | Worker reports VRAM on startup. Dispatcher checks available VRAM against model's known requirement before scheduling. Workers exit after 5-min idle to free VRAM. |
| Graceful degradation | GPU OOM → specific guidance ("Model needs 4GB VRAM, you have 2GB. Switching to CPU.") → optional CPU re-dispatch. |
| Model registry | models.json mapping model name → VRAM requirement, download size, expected load time. UI shows this info. |
Deliverable: No more OOM crashes from model conflicts. GPU utilization visible in status bar.
CEP end-of-life is approximately September 2026. UXP must be production-ready before then.
| Task | Detail |
|---|---|
| Shared component library | extension/shared/ with framework-agnostic components. Both CEP and UXP import from here. Build system outputs two bundles. |
| Feature registry | features.json defines every feature: id, label, endpoint, params schema, requires. Both panels auto-generate UI from this. Adding a feature = one JSON entry + one backend route. |
| UXP feature gap closure | Port remaining ~15% of CEP features to UXP. Mostly: workflow builder, full settings panel, plugin UI. |
| Native UXP timeline access | Replace ExtendScript evalScript() with direct premierepro UXP module for timeline read/write. 10x faster. |
| Premiere menu integration | Right-click → "OpenCut: Remove Silence" / "Add Captions" / "Normalize Audio" via UXP API. |
| CEP deprecation plan | Mark CEP panel as "legacy" in docs. Freeze CEP feature additions. All new features UXP-only after Wave 3. |
Deliverable: UXP panel at 100% parity. CEP can be removed when Adobe enforces it.
Low priority. Flask works fine at current scale. Migrate only if:
- Request validation boilerplate becomes unmanageable (>300 routes)
- WebSocket needs outgrow the current
websocketslibrary - Auto-generated OpenAPI docs become essential for plugin developers
If triggered, migrate one blueprint at a time (system → settings → search → nlp → timeline → jobs → captions → audio → video). Pydantic models replace safe_float()/safe_int() hand-validation.
Continue incremental migration as files are touched. Priority order:
- API layer (
src/api/types.tsfrom OpenAPI schema) - Store/state management
- Tab modules as they're refactored for new features
No dedicated sprint. Piggyback on feature work.
Goal: Add new feature domains that require 1-2 new dependencies each but significantly expand OpenCut's capability.
Timeline: 6-8 weeks (after Wave 1, can overlap with Wave 3) New deps: 4-6 new pip packages New routes: ~30
Shared infrastructure: object detection framework, tracking pipeline, audio masking.
| # | Feature | Effort | New Dep | Detail |
|---|---|---|---|---|
| 55.1 | License Plate Detection & Blur | M | paddleocr or YOLO plate model |
Detect plates per frame → track with IoU → Gaussian blur on tracked regions. |
| 55.3 | Profanity Bleep Automation | S | None (done in Wave 1) | — |
| 55.2 | OCR-Based PII Redaction | L | paddleocr (shared with 55.1) |
OCR → regex PII patterns (SSN, phone, email, CC) → NER for names → track text regions → blur. |
| 55.4 | Document & Screen Redaction | M | OpenCV (existing) | Edge detection → perspective transform → classify as screen/document/whiteboard → blur surface. |
| 55.5 | Audio Speaker Anonymization | M | Existing (pedalboard) | Diarize → target speaker segments → pitch shift + formant shift or TTS resynthesis. |
New dependency: paddleocr (or reuse existing Tesseract if sufficient). One dep serves 55.1 + 55.2.
| # | Feature | Effort | New Dep | Detail |
|---|---|---|---|---|
| 52.2 | Rolling Shutter Correction | L | gyroflow CLI (subprocess) |
Integrate Gyroflow as subprocess with lens profiles. Parse gyro metadata from GoPro/DJI. |
| 13.4 | LOG / Camera Profile Pipeline | M | None | Auto-detect LOG profile from ffprobe metadata → apply bundled technical LUT (free Sony/Canon/Panasonic LUTs). |
| 43.4 | Color Space Auto-Detection | M | None | Read color_primaries/transfer_characteristics from ffprobe → auto-apply correct input transform. |
New dependency: gyroflow CLI (optional, subprocess only — not a pip package).
Shared infrastructure: STFT analysis/resynthesis pipeline.
| # | Feature | Effort | New Dep | Detail |
|---|---|---|---|---|
| 56.4 | Room Tone Auto-Generation | M | None (done in Wave 1) | — |
| 56.3 | AI Environmental Noise Classifier | M | tensorflow-lite or onnxruntime (existing) |
YAMNet model (521 sound classes, TFLite). Classify → selective removal via spectral masking. |
| 56.2 | Spectral Repair / Frequency Removal | M | librosa (existing) |
STFT → identify persistent spectral peaks (hum/buzz) → attenuate → inverse STFT. Auto-detect mode. |
| 56.1 | Visual Spectrogram Editor | L | librosa (existing) |
FFmpeg showspectrumpic or librosa STFT → zoomable canvas in panel → brush tool mask → inverse STFT reconstruction. |
New dependency: None if using onnxruntime (already installed) for YAMNet. Otherwise tflite-runtime (lightweight).
| # | Feature | Effort | New Dep | Detail |
|---|---|---|---|---|
| 60.1 | Auto Proxy Generation | L | None | Detect clips >1080p → FFmpeg scale to target res + CRF 28 → store in ~/.opencut/proxies/ with manifest. Background job. |
| 60.2 | Proxy-to-Full-Res Swap on Export | S | None | Query timeline clip paths via ExtendScript → check against proxy manifest → verify originals exist → report. |
| 60.3 | Media Relinking Assistant | M | None | ExtendScript: enumerate offline items. Python: recursive search by filename + size matching. Batch relink UI. |
| 60.4 | Duplicate Media Detection | M | None | File size grouping → partial hash (first+last 64KB) → full hash for matches. Optional pHash for content matches. |
New dependency: None.
| # | Feature | Effort | New Dep | Detail |
|---|---|---|---|---|
| 13.1 | Real-Time Color Scopes | L | FFmpeg+Pillow | FFmpeg waveform, vectorscope, histogram filters render scope images. Display as image grid in panel. |
| 13.5 | Film Stock Emulation | M | None | Custom 3D LUTs per stock (Kodak/Fuji) + grain overlay + gate weave + halation via blend. Preset package. |
| 13.4 | LOG Camera Profile Pipeline | M | None (listed in 4B) | — |
| 43.1 | ACES Color Pipeline | L | None | ACES IDT/ODT via FFmpeg colorspace + lut3d. Bundled ACES LUTs (free from AMPAS). |
New dependency: None (FFmpeg + bundled LUT files).
Wave 4 Total: ~18 features (excluding duplicates from Wave 1), 1-2 new deps, ~30 new routes
Goal: Build the end-to-end AI dubbing pipeline — the single highest-value new AI capability. This is what ElevenLabs, HeyGen, and Rask.ai charge $50-100/month for.
Timeline: 4-6 weeks (after Wave 3A process isolation is ready) Prerequisite: Wave 3A (GPU process isolation) — dubbing loads multiple large models sequentially New deps: Minimal (leverages existing Chatterbox, Whisper, Demucs, SeamlessM4T)
| # | Feature | Effort | Detail |
|---|---|---|---|
| 62.1 | End-to-End AI Dubbing Pipeline | XL | Transcribe → translate (SeamlessM4T) → voice-clone TTS (Chatterbox) with duration constraints → stem-separate original (Demucs, remove dialogue, keep music/SFX) → mix dubbed dialogue + original music/SFX → export. |
| 62.2 | Isochronous Translation | L | LLM-assisted translation constrained by segment duration. Iterate: translate → estimate TTS duration from syllable count → if too long, ask LLM to rephrase shorter → if too short, expand. Target +-10% of original. |
| 62.3 | Multi-Language Audio Track Management | M | FFmpeg -map to mux multiple audio streams with language metadata. Panel UI: track list with language dropdown, add/remove, default flag. Export multi-track MKV/MP4 or per-language files. |
| 62.4 | Emotion-Preserving Voice Translation | L | Extract prosody (F0 contour via librosa, RMS energy, speaking rate) from original → generate TTS with neutral prosody → transfer original prosody shape to dubbed audio via WORLD vocoder or pitch manipulation. |
Workflow chain: The dubbing pipeline calls 5 existing modules in sequence. The key new code is the orchestrator (core/dubbing.py) and the isochronous translation loop (core/isochron_translate.py).
New dependency: Potentially pyworld for vocoder-based prosody transfer (62.4). Everything else is already installed.
Wave 5 Total: 4 features, 0-1 new deps, ~8 new routes
Goal: Deep features for professional editors, colorists, and post-production specialists. These differentiate OpenCut from casual tools.
Timeline: 8-12 weeks (can be worked on in parallel tracks) New deps: 2-4
| # | Feature | Effort | Detail |
|---|---|---|---|
| 61.4 | Saliency-Guided Auto-Crop | M | Face regions (high weight) + motion regions (frame diff) + text regions (OCR) + high-contrast edges → weighted heat map → place crop to maximize saliency. |
| 13.2 | Three-Way Color Wheels | L | SVG color wheel widgets in panel → map wheel positions to FFmpeg colorbalance filter values (lift/gamma/gain). Preview via frame extraction. |
| 13.3 | HSL Qualifier / Secondary Correction | L | OpenCV HSV range masking with feathered edges → apply corrections to masked region only → composite. Preview matte in panel. |
| # | Feature | Effort | Detail |
|---|---|---|---|
| 59.4 | Script-to-Rough-Cut Assembly | XL | Batch transcribe all footage → fuzzy-match transcript segments against script text → rank matches by similarity + audio quality + face visibility → assemble best take per segment as OTIO/Premiere XML. |
| 59.2 | Shot List Generator from Screenplay | M | Parse screenplay format (INT./EXT., ACTION, DIALOGUE) → LLM suggests shot count and camera angles per scene → export as CSV. |
| 59.1 | AI Storyboard Generation from Script | L | Parse script into shots → generate one image per shot via Stable Diffusion or external API → layout as storyboard grid with descriptions → export PDF. |
| 59.3 | Mood Board Generator from Footage | M | Extract keyframes → k-means color clustering → style tags (warm/cold, contrast, saturation) → suggest matching LUTs → compile as visual reference image. |
| # | Feature | Effort | Detail |
|---|---|---|---|
| 53.1 | Corrupted File Recovery | M | Detect corruption type (missing moov, truncated stream). For missing moov: untrunc algorithm with reference file. For truncated: ffmpeg -err_detect ignore_err salvage. Report recovery stats. |
| 53.4 | SDR-to-HDR Upconversion | L | FFmpeg zscale (bt709 → bt2020) + inverse tone mapping. Apply PQ/HLG transfer function. Embed ST.2086 metadata. |
| 13.6 | Power Windows with Tracking | L | Shape masks (circle, rect, polygon) in panel → track via MediaPipe (face) or SAM2 (object) → apply corrections inside/outside mask via per-frame FFmpeg filter. |
| # | Feature | Effort | Detail |
|---|---|---|---|
| 35.1 | Selective Redaction Tool | M | Click-to-select regions in preview → track across frames → blur/pixelate/black. Export redaction log for audit trail. |
| 35.2 | Chain of Custody Metadata | S | SHA-256 hash of original + all operations applied + timestamps → embed as metadata or export as sidecar JSON. |
| 35.3 | Forensic Enhancement | M | Stabilize + denoise + sharpen + contrast stretch + frame interpolation for low-quality surveillance footage. |
| # | Feature | Effort | Detail |
|---|---|---|---|
| 20.1 | Caption Compliance Checker | M | Parse captions → check against rulesets (Netflix <=42 CPL, FCC <=32 CPL, BBC <=160 WPM, min duration, CPS). Flag violations with auto-fix suggestions. |
| 20.2 | Audio Description Track Generator | L | Detect dialogue pauses (existing VAD) → extract key frames during pauses → describe via LLM vision → TTS synthesis → mix into gaps → export as AD track. |
| 27.1 | C2PA Content Credentials | M | Embed Content Authenticity Initiative metadata (origin, edit history, AI disclosure). c2pa-python library. |
Wave 6 Total: ~16 features, 2-3 new deps, ~25 new routes
Goal: Forward-looking AI capabilities and niche professional features. These are differentiators, not table-stakes.
Timeline: Ongoing (8-16 weeks, lowest priority) New deps: Several (heavy AI models) Prerequisite: Wave 3A (GPU process isolation) essential for multiple large models
| # | Feature | Effort | New Dep | Detail |
|---|---|---|---|---|
| 54.2 | Image-to-Video Animation | L | diffusers (existing) |
SVD or CogVideoX with image conditioning → 2-6s clip from still image + motion prompt. |
| 54.5 | AI Background Replacement | L | diffusers (existing) |
RVM foreground extraction + Stable Diffusion background from text prompt → composite. |
| 54.1 | AI Outpainting / Frame Extension | L | diffusers (existing) |
Extend canvas to target aspect ratio → inpaint borders via ProPainter or SD. Keyframe-based for temporal consistency. |
| 54.3 | AI Scene Extension | XL | diffusers (existing) |
Feed last N frames to video prediction model → generate continuation. Best for static scenes. |
| 21.1 | Multimodal Timeline Copilot | XL | LLM API (existing) | Chat interface backed by multimodal AI that sees video + audio + transcript. Navigate, select, edit via natural language. |
| # | Feature | Effort | Detail |
|---|---|---|---|
| 51.2 | Equirectangular to Flat Projection | M | FFmpeg v360 filter. Keyframeable yaw/pitch/roll for virtual camera paths. |
| 51.3 | FOV Region Extraction from 360 | M | Face detection in equirectangular space → per-speaker flat extraction with smooth tracking → multicam XML. |
| 51.1 | 360 Video Stabilization | L | Parse gyro metadata (GoPro GPMF, Insta360) → apply inverse rotation via FFmpeg v360. |
| 51.4 | Spatial Audio Alignment | L | Map speaker positions from face detection → route mono dialogue to correct ambisonic channel. First-order ambisonics output. |
| # | Feature | Effort | Detail |
|---|---|---|---|
| 41.1 | DJI Telemetry Data Overlay | M | Parse DJI SRT files → render altitude, speed, GPS, battery as configurable overlay. |
| 42.1 | Image Sequence Import & Assembly | M | Import folder of images (TIFF, EXR, DPX, PNG) → assemble as video with configurable FPS and transitions. |
| 39.1 | Elgato Stream Deck Integration | M | WebSocket/HTTP listener for Stream Deck commands → map buttons to OpenCut operations. Plugin for Stream Deck SDK. |
| 12.1 | Gaming Highlight / Kill Detection | L | Multi-signal fusion: audio peaks + motion intensity + optional OCR on kill feed → score segments → extract top clips. |
| 33.1 | Lecture Recording Auto-Split | M | Scene detection + chapter generation → split lecture by topic → generate per-topic clips with title cards. |
| 46.1 | Multi-Step Autonomous Editing Agent | XL | LLM plans editing steps from high-level instruction → executes via OpenCut API → iterates on result quality. Full agent loop with human review checkpoints. |
Wave 7 Total: ~15 features, 0-2 new deps (most already installed), ~20 new routes
Wave 1 (Quick Wins) |=============================|
Wave 2 (Pipelines) |=======================|
Wave 3A (GPU Isolation) |========================|
Wave 3B (UXP Parity) |=====================|
Wave 4 (New Domains) |========================|
Wave 5 (AI Dubbing) |================|
Wave 6 (Pro Features) |===========================|
Wave 7 (Emerging) |=========================>
Wk 1 Wk 6 Wk 12 Wk 18 Wk 24 Wk 30+
Critical path: Wave 3A (GPU isolation) must land before Waves 5 and 7A (heavy AI features).
Parallel tracks:
- Wave 1 + Wave 2 can run simultaneously (different developers or even same developer — no conflicts)
- Wave 3A + Wave 3B are independent
- Wave 4 can start as soon as Wave 1 is done (shares no code)
- Wave 6 features are independent of each other (can be cherry-picked)
| Milestone | Routes | Core Modules | Tests (est.) |
|---|---|---|---|
| Current (v1.9.26) | 254 | 68 | 867 |
| After Wave 1 | ~290 | ~78 | ~1,050 |
| After Wave 2 | ~310 | ~85 | ~1,200 |
| After Wave 4 | ~340 | ~95 | ~1,400 |
| After Wave 5 | ~348 | ~99 | ~1,500 |
| After Wave 6 | ~373 | ~110 | ~1,700 |
| After Wave 7 | ~393 | ~120 | ~1,900 |
| # | Feature | Wave | Effort | Why Critical |
|---|---|---|---|---|
| 3A | GPU Process Isolation | 3 | XL | Prerequisite for all heavy AI features. Eliminates OOM crashes. |
| 3B | UXP Full Parity | 3 | XL | CEP end-of-life ~Sept 2026. Must be ready before then. |
| 32.1 | Hardware-Accelerated Encoding | 1 | M | Users with GPUs expect NVENC/QSV. Every other tool has this. |
| 58.1 | Long-Form to Multi-Short Extraction | 2 | L | $228/year competitor (Opus Clip). Highest-value pipeline. |
| # | Feature | Wave | Effort | Why High Impact |
|---|---|---|---|---|
| 62.1 | End-to-End AI Dubbing | 5 | XL | $50-100/month competitor category. Uses all existing modules. |
| 57.1 | Split-Screen Templates | 1 | M | CapCut/iMovie table-stakes. Massive content category. |
| 55.1 | License Plate Blur | 4 | M | Privacy law compliance. Every content creator needs this. |
| 55.3 | Profanity Bleep | 1 | S | Broadcast requirement. Trivial to build. |
| 53.2 | Adaptive Deinterlacing | 1 | S | Every NLE has this. Legacy footage is common. |
| 52.1 | Lens Distortion Correction | 1 | M | Standard camera correction. lensfun database is free. |
| 56.4 | Room Tone Auto-Generation | 1 | M | iZotope RX feature. Makes silence removal sound professional. |
| 60.1 | Auto Proxy Generation | 4 | L | Premiere/Resolve/FCPX all have this. 4K editing prerequisite. |
| 61.2 | Shot Type Classification | 1 | M | Enables intelligent editing decisions and footage search. |
| 45.2 | AV1 Encoding | 1 | M | Modern codec with 30-50% bitrate savings. YouTube prefers it. |
| 45.1 | ProRes Export (Windows) | 1 | M | Professional delivery format. Resolve offers this on Windows. |
| 13.1 | Real-Time Color Scopes | 6 | L | Every colorist needs scopes. Color tools are blind without them. |
| 59.4 | Script-to-Rough-Cut | 6 | XL | Biggest time saver in post-production. Avid ScriptSync competitor. |
| 20.1 | Caption Compliance Checker | 6 | M | Netflix/FCC/BBC requirements. Prevents platform rejection. |
| 24.1 | Shot-Change-Aware Subtitle Timing | 2 | M | Broadcast QC standard. Simple post-processing. |
All remaining Wave 1-6 features not listed above (~60 features).
All Wave 7 features + FastAPI migration + TypeScript + niche items (~40 features).
| Metric | v1.9.26 (Current) | After Waves 1-2 | After Waves 3-5 | After Waves 6-7 |
|---|---|---|---|---|
| API routes | 254 | ~310 | ~348 | ~393 |
| Core modules | 68 | ~85 | ~99 | ~120 |
| Tests | 867 | ~1,200 | ~1,500 | ~1,900 |
| Time to first useful action | ~30s (workflow) | ~15s (pipeline) | ~10s (context + agent) | ~5s (copilot) |
| Install success rate | ~90% | ~92% | ~95% (isolation) | ~99% (Docker) |
| Competitor features covered | ~60% | ~75% | ~85% | ~95% |
| Features available in UXP | ~85% | ~90% | 100% | 100% |
| New deps added | 0 | 0 | 1-2 | 4-6 |
| Risk | Impact | Mitigation |
|---|---|---|
| CEP deprecation before UXP ready | High | Wave 3B is P0. Start immediately. Freeze CEP feature additions. |
| GPU process isolation complexity | High | Start with simple subprocess model. Upgrade to full worker pool later. Ship incremental improvements. |
| AI model download sizes | Medium | Models are optional. Clear size warnings in UI. Pre-download in installer. Offer cloud API fallback where possible. |
| Too many features → quality regression | High | Every new feature gets a smoke test before merge. Ruff lint on CI. No feature without a test. |
| Dependency conflicts from new packages | Medium | One new dep per feature max. Pin versions. Test in isolated venv before adding to pyproject.toml. |
| Scope creep from 302-feature plan | Medium | Waves are independently shippable. Only commit to one wave at a time. Review and reprioritize between waves. |
This roadmap should be reviewed at the start of each wave and reprioritized based on user feedback, competitive landscape changes, and lessons learned from the previous wave.
Auditor: Principal Systems Architect analysis Date: 2026-04-14 Baseline: v1.14.0 (1,088 routes, 408 core modules, 83 blueprints, 87 test files, 6,925 tests) Method: Full codebase scan, security audit, architecture bottleneck analysis, test/CI pipeline review
Context: This roadmap was authored at v1.9.26 (254 routes, 68 modules). The codebase has since grown 4.3x in routes and 6x in modules. The Wave 1-7 structure and growth projections are now obsolete — the "After Wave 7" target of ~393 routes was surpassed at v1.10.5. This analysis identifies the gaps that the rapid feature expansion has opened.
-
GPU process isolation is still unimplemented (Wave 3A). This was marked P0 and remains the single most critical infrastructure gap.
MAX_CONCURRENT_JOBS = 10inopencut/jobs.py:42allows 10 simultaneous ML model loads into VRAM. PyTorch models (Demucs, Real-ESRGAN, InsightFace, SAM2, CLIP, etc.) each consume 500MB-4GB VRAM. Concurrent loads will OOM on consumer GPUs. No memory reservation, no model-aware scheduling, no graceful degradation path exists. Every AI feature added since v1.10 has widened this gap. The 408-module codebase now has 40+ modules that load GPU models — 6x more than when Wave 3A was planned.- Recommended action: Implement a GPU memory budget system immediately. At minimum: reduce
MAX_CONCURRENT_JOBSto 3 for GPU-tagged routes, add a@gpu_exclusivedecorator that serializes GPU model access behind a semaphore, and report VRAM usage in/system/status.
- Recommended action: Implement a GPU memory budget system immediately. At minimum: reduce
-
Rate limiting covers 4% of async routes. Security audit found 597 async route handlers but only 23 rate-limit calls. The
require_rate_limit()decorator exists and works, but was only applied to model-install and a handful of AI routes. All 574 unprotected async routes accept concurrent requests limited only byMAX_CONCURRENT_JOBS=10. A single client can trivially exhaust all 10 job slots with expensive operations (batch rendering, video processing, ML inference), starving other requests.- Recommended action: Introduce rate-limit categories (
gpu_heavy,cpu_heavy,io_bound,light) and apply to all async routes. GPU-heavy operations should share a pool of 2-3 concurrent slots. CPU-heavy should cap at 4-6.
- Recommended action: Introduce rate-limit categories (
-
Test coverage is broad but shallow. 87 test files exist with 6,925 test functions, but the architecture audit reveals 97% of the 408 core modules lack dedicated behavioral tests — they're only exercised indirectly through route smoke tests. The smoke tests in
test_route_smoke.pyuse broad status code assertions likeassert resp.status_code in (200, 400, 429)which pass regardless of whether the feature works correctly. CI enforces only 50% line coverage (--cov-fail-under=50inbuild.yml), which is insufficient for a codebase of this size and complexity.- Recommended action: Raise CI coverage threshold to 65% (target 80% over 2 sprints). Add schema validation for route responses (JSON structure, not just "is JSON"). Prioritize integration tests for the 40 GPU-model-loading modules — these are the highest-risk code paths with the least coverage.
-
Roadmap growth projections are 3x out of date. The "Route Growth Projection" table estimates 393 routes after all 7 waves. Actual count is 1,088 — a 2.8x overshoot. The "Success Metrics" table, "Completed Work" section, and wave feature lists don't reflect v1.10-v1.14 additions (categories 63-77, 155 new core modules, 20 new route blueprints). The roadmap should be rebased to reflect current reality so it can be trusted for planning.
- Recommended action: Rebase all tables to v1.14.0 actuals. Mark Wave 1-2 features that were implemented in v1.10-v1.14 as DONE. Update dependency legend with new module families. Revise success metrics to reflect 1,088-route baseline.
-
helpers.pyis a god module (350 imports). Every core module and most route files import fromopencut/helpers.py. It contains FFmpeg execution, video probing, output path logic, temp file cleanup, package installation, and progress utilities — responsibilities that span 6+ concerns. This makes it a merge conflict magnet, impossible to test in isolation, and a startup bottleneck (every import chain pulls in the entire module).- Recommended action: Decompose into
helpers/ffmpeg.py,helpers/video_probe.py,helpers/paths.py,helpers/cleanup.py,helpers/packages.py. Re-export fromhelpers/__init__.pyfor backward compat. Do this incrementally during feature work, not as a dedicated refactor sprint.
- Recommended action: Decompose into
-
UXP migration has 5 months remaining. CEP end-of-life is approximately September 2026. The roadmap states UXP is at ~85% feature parity (Wave 3B). The UXP panel (
extension/com.opencut.uxp/) has 7 tabs vs. CEP's 8, and the UXP main.js is 1,523 lines vs. CEP's 7,730 — indicating significant feature gaps in the frontend. No UXP-specific tests exist in CI. The CEP panel continues to receive features (v1.14.0 version bumps touch CEP files), violating the roadmap's "freeze CEP feature additions" directive.- Recommended action: Audit UXP vs. CEP parity at the feature level (not tab level). Add UXP smoke test to CI. Enforce CEP freeze — new frontend features go to UXP only.
-
No type checking in CI. 523 Python files with no mypy or pyright enforcement. Type errors (None where str expected, dict where dataclass expected, wrong callback signature) are caught at runtime — if at all. The
on_progresscallback pattern is already documented in CLAUDE.md as a gotcha (core modules call with 1 arg, routes define closures with 2 args), which is exactly the class of bug static typing catches.- Recommended action: Add
mypy --ignore-missing-imports opencut/to CI. Start with--no-strictand fix errors incrementally. Target: 0 type errors inopencut/core/within 2 sprints.
- Recommended action: Add
-
Untracked subprocesses can orphan on cancel. The
@async_jobdecorator registers the job's main thread for cancellation, and_register_job_process()tracks Popen handles. But 158 subprocess calls across core modules callsubprocess.run()directly — these finish synchronously within the job thread but can't be interrupted mid-execution. If a user cancels a job while FFmpeg is mid-render (a 30-minute operation), the FFmpeg process runs to completion even though the job is marked cancelled. The process exit code is then silently discarded.- Recommended action: Wrap long-running subprocess calls in a pattern that checks
job_cancelledflag and sends SIGTERM to the child process. Alternatively, refactorrun_ffmpeg()in helpers.py to accept ajob_idparameter and auto-register the Popen for cancellation.
- Recommended action: Wrap long-running subprocess calls in a pattern that checks
-
No security scanning in CI pipeline. The
build.ymlworkflow runs ruff lint and pytest but has no security tooling: no bandit (Python security linter), no CodeQL (GitHub's code scanning), no dependabot/Snyk (dependency vulnerability scanning), no SBOM generation. For a project that executes FFmpeg subprocesses, runspip installat runtime viasafe_pip_install(), and loads ML models from external sources, this is a meaningful gap.- Recommended action: Add
bandit -r opencut/ -llto CI (catches high-confidence security issues). Enable GitHub Dependabot for dependency alerts (zero-effort, just adddependabot.yml). Add CodeQL for deeper analysis.
- Recommended action: Add
-
Temp file accumulation under load. 93 modules create temp files via
tempfile.mkstemp()orNamedTemporaryFile(). The deferred cleanup mechanism (_schedule_temp_cleanup()in helpers.py) uses a 5-second delay with 3 retries. Under concurrent load (10 video processing jobs), this means hundreds of multi-GB temp files (intermediate FFmpeg outputs, extracted frames, model outputs) can accumulate before cleanup fires. No disk quota, no max-temp-size check, no cleanup-on-startup sweep.- Recommended action: Add a startup sweep of
tempfile.gettempdir()for staleopencut_*temp files. Add a periodic (60s) background cleanup for files older than 10 minutes. Log temp disk usage in/system/status.
- Recommended action: Add a startup sweep of
-
25+ tests use
time.sleep()creating flaky CI. Tests intest_batch_executor.py,test_batch_parallel.py,test_boolean_coercion.py,test_integration_ffmpeg.py, andtest_preview_realtime.pycontain sleeps ranging from 10ms to 500ms. These are timing-dependent and will intermittently fail on slow CI runners, Windows VMs, or under load. Additionally,test_solver_agent.pyusesrandom.seed(42)but other tests don't seed, introducing non-determinism.- Recommended action: Replace
time.sleep()in tests with event-based synchronization (threading.Event, condition variables). For async result tests, poll with timeout rather than fixed sleep. Audit and seed all random usage.
- Recommended action: Replace
-
No auto-generated API documentation. With 1,088 routes across 83 blueprints, there is no OpenAPI/Swagger spec, no auto-generated endpoint catalog, and no machine-readable API schema. Plugin developers and external integrators must read route source code. The roadmap's Wave 3C notes FastAPI migration (which brings auto-generated OpenAPI) but defers it. The original trigger — "if >300 routes" — was passed long ago.
- Recommended action: Generate an OpenAPI spec from Flask routes using
flask-smorestorapispecwithout migrating to FastAPI. Serve Swagger UI at/api/docsfor development mode only. This is a 1-day effort that unlocks plugin ecosystem development.
- Recommended action: Generate an OpenAPI spec from Flask routes using
-
Blueprint registration is sequential and eager.
register_blueprints()inroutes/__init__.pyperforms 83 sequentialimportstatements at app startup. Each import may trigger module-level initialization (cache setup, constant computation, availability checks). Measured impact is 2-5 seconds on startup — not a production issue but noticeable during development when the server auto-restarts on file changes.- Recommended action: No immediate action needed. If dev-cycle time becomes a complaint, implement lazy blueprint registration (register on first request to URL prefix).
-
No performance regression detection. No benchmarks, no load tests, no response-time tracking in CI. With 1,088 routes and 408 modules, a single change to
helpers.pyorjobs.pycould degrade performance across hundreds of endpoints with no visibility.- Recommended action: Add a simple benchmark suite (10 representative endpoints, measure p50/p95 response time) that runs in CI and fails on >20% regression. Use
pytest-benchmarkor custom timing.
- Recommended action: Add a simple benchmark suite (10 representative endpoints, measure p50/p95 response time) that runs in CI and fails on >20% regression. Use
-
Missing production governance files. No
SECURITY.md(vulnerability disclosure process), noCODE_OF_CONDUCT.md, noCONTRIBUTING.mdwith architecture guide, no SBOM (software bill of materials). For an open-source project with 408 modules and ML model downloads, these are expected by enterprise adopters.- Recommended action: Add
SECURITY.mdwith disclosure process and supported-versions table. Generate SBOM frompyproject.tomldeps.
- Recommended action: Add
-
FastAPI migration trigger has been reached. The roadmap defers FastAPI migration until ">300 routes" with the rationale that validation boilerplate would become unmanageable. Current state: 1,088 routes, 879 mutation endpoints, manual
safe_float()/safe_int()/safe_bool()validation in every handler. Pydantic models would eliminate ~60% of per-route validation boilerplate and provide automatic request/response schema generation.- Recommended action: This remains low priority because Flask works and migration risk is high with 83 blueprints. However, the original deferral rationale no longer holds. If a major refactor is planned (e.g., helpers.py decomposition), consider migrating 1-2 blueprints to FastAPI as a proof-of-concept to measure the cost/benefit.
| Finding | Priority | Effort | Impact | Status |
|---|---|---|---|---|
| GPU process isolation (Wave 3A) | HIGH | XL | Eliminates OOM crashes | Not started |
| Rate limiting expansion | HIGH | M | Prevents DoS / resource exhaustion | 4% coverage |
| Test depth & coverage threshold | HIGH | L | Catches regressions before release | 50% threshold |
| Roadmap rebase to v1.14.0 | HIGH | S | Accurate planning | Stale since v1.9.26 |
| helpers.py decomposition | MEDIUM | M | Reduces coupling, merge conflicts | 350 imports |
| UXP full parity (Wave 3B) | MEDIUM | L | CEP EOL Sept 2026 | ~85% parity |
| Type checking in CI | MEDIUM | M | Catches type bugs statically | Not started |
| Subprocess cancellation | MEDIUM | M | Clean job cancel behavior | 158 untracked calls |
| Security scanning in CI | MEDIUM | S | Catches vulnerabilities | Not started |
| Temp file disk management | MEDIUM | S | Prevents disk exhaustion | No quota |
| Flaky test elimination | MEDIUM | S | Reliable CI | 25+ sleep-based tests |
| Auto-generated API docs | LOW | S | Enables plugin ecosystem | No spec exists |
| Performance benchmarks in CI | LOW | M | Detects regressions | Not started |
| Production governance files | LOW | S | Enterprise readiness | Missing |
| FastAPI migration evaluation | LOW | XL | Reduces boilerplate at scale | Deferred |
- hetpatel-11/Adobe_Premiere_Pro_MCP — https://github.com/hetpatel-11/Adobe_Premiere_Pro_MCP — MCP server bridging AI assistants (Claude/Codex) to Premiere Pro's scripting engine; CEP panel + experimental UXP plugin; covers project, ingest, sequence, timeline, transitions, effects, keyframes, metadata, export
- sebinside/PremiereRemote — https://github.com/sebinside/PremiereRemote — local HTTP/WebSocket server inside Premiere that lets external tools trigger ExtendScript; foundation for AutoHotkey and AI integrations
- cameron-astor/jumpcut — https://github.com/cameron-astor/jumpcut — jumpcut plugin that uses an external pyinstaller binary for waveform analysis (because CEP can't read audio); exact architectural model OpenCut uses
- Adobe-CEP/Samples — https://github.com/Adobe-CEP/Samples — official reference implementations of CEP extensions across Adobe apps
- adobe-cep organization — https://github.com/adobe-cep — all official samples, CEP debug and packaging tooling
- SLNimesh/pro-console — https://github.com/SLNimesh/pro-console — fx-console-style quick effect browser for Premiere; good UX reference
- MCP server bridge so OpenCut can be driven by any MCP-speaking AI client (Premiere Pro MCP)
- Local HTTP + WebSocket trigger surface so AutoHotkey, Stream Deck, shell scripts can call OpenCut actions (PremiereRemote)
- External-binary pattern for CPU-heavy work (silence detection, OCR, transcription) — keep CEP panel thin, ship signed helper EXE (jumpcut)
- UXP migration plan: track which operations work in UXP today, feature-flag them in the extension, keep CEP fallback until parity (Premiere MCP tracks both)
- Signed-bundle distribution with ZXP/ZIP + installer, and auto-update via version manifest (Adobe CEP packaging)
- "Effect console" quick palette: Ctrl-Space to fuzzy-search effects and presets (pro-console)
- Natural-language command-line inside the panel that calls the MCP server locally ("trim last 5 seconds of selected", "add 30-frame dissolve") (Premiere MCP)
- ExtendScript library of composable helpers: importClips, insertAtPlayhead, applyEffect with sane defaults (PremiereRemote has a foundation)
- Debug mode detector that warns the user once and writes PlayerDebugMode reg key only on user confirm (CEP standard)
- Ingest workflow: watch folder, auto-import, tag by filename pattern, route to sequence bins (community extensions)
- CEP panel as UI + local HTTP server for out-of-process work — keeps ExtendScript minimal, pushes heavy lifting to Python/Node (PremiereRemote, jumpcut)
- MCP protocol as the AI ↔ app boundary, avoiding custom chat UI and letting users choose their client (Premiere MCP)
- External-helper-binary pattern: ship a PyInstaller EXE alongside the .zxp and call it from ExtendScript via file.exec; remember the Windows fork-bomb requires multiprocessing.freeze_support() (project rule)
- UXP-CEP dual-ship: feature detection on load, choose the best API per operation (Premiere MCP)
- Audio-waveform out-of-band fetch: call ffmpeg to decode, analyze in helper, return JSON of silence/markers (jumpcut)
- Adobe-CEP/Samples/PProPanel — https://github.com/Adobe-CEP/Samples/tree/master/PProPanel — the only Adobe-maintained reference covering the full Premiere Pro ExtendScript surface (sequence, clip, marker, metadata, import); every new panel action should start from the matching method here
- Adobe-CEP/CEP-Resources CEP 10.x Cookbook — https://github.com/Adobe-CEP/CEP-Resources/blob/master/CEP_10.x/Documentation/CEP%2010.0%20HTML%20Extension%20Cookbook.md — manifest.xml schema, Node enablement flag, debug
.debugfile format, CSXS event round-trip - Breakthrough/PySceneDetect — https://github.com/Breakthrough/PySceneDetect — BSD-3 scene-cut library;
ContentDetector,AdaptiveDetector,ThresholdDetector; timecode output directly compatible with Premiere sequence markers - geerlingguy/final-cut-it-out — https://github.com/geerlingguy/final-cut-it-out — silencedetect -> XML edit pattern; adapt the parse-silence-ranges step for Premiere .prproj / sequence-marker output
- mifi/lossless-cut — https://github.com/mifi/lossless-cut — GPLv2 reference for FFmpeg-driven lossless cut planning + silence/black-scene detection UX; great source for silence threshold/default values to preseed
- hyperbrew/bolt-cep — https://github.com/hyperbrew/bolt-cep — modern React/Vue/Svelte CEP scaffold with TypeScript and hot reload; consider adopting for future panel iterations to shed jQuery
- alphacep/vosk-api — https://github.com/alphacep/vosk-api — offline STT; small (50MB) and large (1.8GB) English models; use for "cut silence by filler-word" beyond raw silencedetect
- openai/whisper (or faster-whisper ctranslate2) — https://github.com/SYSTRAN/faster-whisper — GPU-optional transcript-driven cut planning; produces word-level timestamps needed for caption-accurate trims
- evalScript is async with a string-only result — https://github.com/Adobe-CEP/CEP-Resources/blob/master/CEP_10.x/Documentation/CEP%2010.0%20HTML%20Extension%20Cookbook.md — results come back as a string later, not synchronously; JSON-encode all payloads both directions and time out at 30s
- Node integration disabled by default — Cookbook above — add
<CEFCommandLine><Parameter>--enable-nodejs</Parameter></CEFCommandLine>to manifest.xml orrequire('child_process')throws; also must sign/self-sign to load unsigned extensions (PlayerDebugMode=1) - ExtendScript is ES3 — https://community.adobe.com/t5/after-effects-discussions/how-extendscript-cep-actually-works-on-the-back-end/m-p/14354578 — no
const, no arrow functions, no Promise; transpile from TS with @types/extendscript or ship ES3 hand-written - Vulcan async queue reorders calls — same community post — concurrent evalScript calls from the panel can return out-of-order; queue them client-side with a single in-flight request or you get dropped-frame metadata
- CEP is deprecated, UXP is the future — CEP Cookbook — Premiere hasn't shipped UXP panels yet but is on the roadmap; keep the ExtendScript layer thin so a UXP port is feasible
- PySceneDetect memory with HEVC — https://github.com/Breakthrough/PySceneDetect/issues?q=hevc — HEVC decode via OpenCV may balloon RAM on long files; use
--backend pyavand stream rather than load - silencedetect dB threshold false positives — https://dev.to/dak425/automatically-trim-silence-from-video-with-ffmpeg-and-python-2kol —
-35dBis too aggressive for quiet rooms; start at-30dB, d=400msand tune per-clip - Flask single-threaded by default — https://flask.palletsprojects.com/en/stable/deploying/ — Python backend must run under
waitressorgunicornon Windows or long renders block the panel's next request
- CEP 12 target (manifest
<RequiredRuntime>CSXS 12.0) — covers Premiere 2024/2025/2026; older panels stay on CSXS 10; gotcha: signed extensions required unless user setsPlayerDebugMode=1(HKCUSoftware\Adobe\CSXS.12) - CSInterface.js (pinned version from
Adobe-CEP/CEP-Resources/CEP_12.x/CSInterface.js) — vendor rather than npm to guarantee compatibility; gotcha: don't mix CSInterface versions between panels in the same extension - Flask 3.0+ + waitress 3.0 for Windows service mode; listen on
127.0.0.1:0(OS-assigned port) and pass the port to the panel viaCSEvent— hardcoded ports conflict with other users' panels - PySceneDetect 0.6.4+ —
pip install scenedetect[opencv]; usedetect('in.mp4', AdaptiveDetector(adaptive_threshold=3.0), show_progress=False)and emit(start, end)asFrameTimecode.get_timecode()strings; gotcha: Windows path backslashes pass through OK but PremiereimportFilewants forward slashes - FFmpeg 7.1 static builds — https://www.gyan.dev/ffmpeg/builds/ — bundle in
server/bin/ffmpeg.exe; setFFMPEG_BINARYenv var for PySceneDetect and moviepy; gotcha: HEVC hardware decode via-hwaccel d3d11vaonly works when Premiere isn't currently holding the decoder - moviepy 2.1+ — for silence-based trimming with audio; gotcha: moviepy 2.x broke the 1.x API (
VideoFileClip.subclip->subclipped); pin exactly - Vosk 0.3.45 + English small model (50MB) — offline STT; gotcha: streaming API buffers must be 16kHz mono PCM — transcode via
ffmpeg -ar 16000 -ac 1 -f s16lebefore feeding - faster-whisper 1.0+ with CTranslate2 — for higher-accuracy transcript cuts; gotcha: CPU int8 is ~3x slower than GPU fp16 but doesn't require CUDA DLLs — default to CPU, let power users enable CUDA
- PyInstaller 6.10 with the
multiprocessing.freeze_support()guard (see user global instructions) — every Python subprocess spawned by the panel must short-circuit onsys.frozenor you get the Windows fork-bomb
Updated: 2026-04-17
Baseline: v1.32.0 (1,335 routes, ~360 core modules, 7,551+ tests)
Research pass: April 2026 competitive audit + fresh GitHub OSS survey (see §L-OSS below)
This wave synthesises:
- Gaps identified in
research.md(§1–§8, April 2026) - Newly discovered OSS tools not yet referenced in any prior OpenCut roadmap document
- Promotion of five Wave K stubs to full implementation
All guiding principles from ROADMAP-NEXT.md apply: never break what works, one new required dep per feature maximum, permissive licences only, match existing patterns (check_X_available(), @async_job, queue allowlist).
Goal: Make OpenCut scriptable, agent-native, and catch the remaining UX gaps that competing tools exploit in daily use.
New required deps: mcp[cli] (MCP server), transformers already present (caption translate reuses it), flask-sse or stdlib queue.Queue (SSE)
New routes: ~22
| # | Feature | Route(s) | Module | Dep | Effort | Licence |
|---|---|---|---|---|---|---|
| L1.1 | SSE job progress streaming — server-sent events endpoint for any async job; replace polling GET /jobs/{id}/status with a streaming text/event-stream feed. Emits {pct, msg, eta_s}. Improves perceived performance for long encodes / AI inference. |
GET /jobs/{id}/progress/stream |
core/job_sse.py |
stdlib queue.Queue + Flask Response(stream_with_context) |
S | — |
| L1.2 | Caption translation — translate existing .srt/.vtt captions (from any pipeline) to any of 200 languages via NLLB-200 (1.3B, Apache-2). Closes the Captions.ai and SubMagic multilingual gap. POST accepts { path, src_lang, tgt_lang[] }, returns translated SRT paths. |
POST /captions/translate, GET /captions/translate/languages |
core/caption_translate.py |
transformers (already installed, NLLB-200 weights ~5 GB; lazy-load) |
M | Apache-2 (NLLB-200), MIT (sentencepiece) |
| L1.3 | MCP server interface — expose all 1,335 OpenCut routes as MCP tools so any MCP-speaking AI client (Claude Desktop, Cline, Cursor, Codex) can drive OpenCut without a custom chat UI. Auto-generates tool schemas from Flask route docstrings. opencut/mcp_server.py is the entry point; python -m opencut.mcp_server launches it. Closes the Crayotter / Premiere-MCP agent-native gap. |
opencut/mcp_server.py (standalone process, exposes stdio MCP transport) |
opencut/mcp_server.py |
mcp[cli] ≥1.0 (MIT) |
M | MIT |
| L1.4 | Upscaling hub dispatcher — single smart route POST /video/upscale/smart picks the best available upscaler (RealESRGAN → BSRGAN fallback → ffmpeg lanczos emergency) based on content type (face, natural, animation) detected from the first 8 frames via CLIP-IQA+. UX consolidation; no new model downloads required. |
POST /video/upscale/smart, GET /video/upscale/smart/info |
core/upscale_hub.py |
Pipeline (existing: RealESRGAN, BSRGAN, CLIP-IQA+, FFmpeg) | S | — |
| L1.5 | ElevenLabs TTS cloud backend — cloud fallback for users whose GPU can't run local TTS (MaskGCT/CosyVoice2/F5-TTS). POST /audio/tts/elevenlabs proxies the v1 /text-to-speech API. Requires user-supplied API key stored in opencut/config.json (never in source). Surfaces key prompt in /audio/tts/backends 503 hint. |
POST /audio/tts/elevenlabs, GET /audio/tts/elevenlabs/voices |
core/tts_elevenlabs.py |
elevenlabs SDK ≥1.0 (Apache-2) |
S | Apache-2 |
| L1.6 | AI face reshaper — apply facial geometry corrections (jaw slim, eye enlarge, nose reduce, chin lift) using MediaPipe face mesh + thin-plate-spline (TPS) warp. Processes each frame independently; Cutie temporal propagation keeps the mask consistent. DaVinci Resolve 21 ships this as a premium AI feature. | POST /video/face/reshape, GET /video/face/reshape/info |
core/face_reshape.py |
mediapipe ≥0.10 (Apache-2) — new dep |
L | Apache-2 |
| L1.7 | AI blemish / skin retouching — GFPGAN-guided skin inpainting (bilateral filter + frequency separation) to suppress blemishes, even skin tone, reduce under-eye circles. Works per-frame with Cutie mask; no face geometry deformation. Optional strength slider 0.0–1.0. | POST /video/face/retouch, GET /video/face/retouch/info |
core/skin_retouch.py |
gfpgan (already in OpenCut for face restore) — no new dep |
M | Apache-2 |
| L1.8 | Job history panel — persist completed/failed async jobs to SQLite (opencut/data/jobs.db), queryable by time range, route, status, media path. Powers a "recent operations" panel in the UI. Routes return {jobs: [...], total, pages}. |
GET /jobs/history, GET /jobs/history/{job_id}, DELETE /jobs/history/{job_id} |
core/job_history.py |
sqlalchemy (already present) — no new dep |
S | — |
| L1.9 | Bulk clip operations — apply a route (silence removal, stabilise, denoise, upscale, caption) to a folder of clips in one API call. Job fan-out with per-clip SSE progress and a summary result. Closes the CapCut Batch Export gap. | POST /clips/bulk/process, GET /clips/bulk/status/{batch_id} |
core/bulk_processor.py |
Pipeline + L1.1 SSE — no new dep | M | — |
Goal: Integrate five OSS models discovered in the April 2026 research pass that are not referenced in any previous OpenCut roadmap document. All ship as 503 stubs behind check_X_available() with clear install instructions; promoted to Tier 1 once stabilised.
New required deps: framepack, acestep, sparktts (one each, all Apache-2)
New routes: ~18
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| L2.1 | FramePack image-to-video (NeurIPS 2025, lllyasviel) — next-frame-prediction video diffusion that reframes the T2V problem as sequential frame conditioning. Generates up to 60-second video from a single image + prompt on 6 GB VRAM — far below LTX-Video (12 GB) or Open-Sora (24 GB). Apache-2, 13B parameter model, ships an optimised inference server. Adds a fourth T2V backend to the existing dispatcher (/generate/backends). Different architecture → different aesthetic; creator toolbox benefit. |
POST /generate/framepack, GET /generate/framepack/info |
core/gen_video_framepack.py |
framepack (Apache-2) — new dep |
L | Apache-2 | First VRAM-frugal T2V backend; democratises video gen on consumer GPUs |
| L2.2 | ACE-Step full-song music generation (ACE Studio + StepFun, January 2026) — Apache-2 foundation model for full-length music generation (up to 4 min) with lyric alignment, voice cloning, stem separation, lyric editing, and repainting. 3.5B model; generates 1 min of music in 4.7 s on RTX 3090. Supersedes the Wave K AudioGen SFX-only route for music-creation workflows. Ships POST /audio/music/acestep (text+lyrics→song) and POST /audio/music/acestep/edit (lyric editing via flow-edit). |
POST /audio/music/acestep, POST /audio/music/acestep/edit, POST /audio/music/acestep/extend, GET /audio/music/acestep/info |
core/music_acestep.py |
acestep ≥0.1 (Apache-2) — new dep |
L | Apache-2 | First full-song-with-lyrics generator in OpenCut; closes Suno/Udio gap locally |
| L2.3 | Spark-TTS voice synthesis (SparkAudio, March 2026, Apache-2) — SOTA zero-shot TTS with natural prosody and voice cloning from a 3-second reference clip. ONNX-compatible, runs on CPU without CUDA. Benchmarks show comparable or better MOS to ElevenLabs v2 on English speech. Sixth TTS backend; positioned as the default for users without GPU. | POST /audio/tts/spark, GET /audio/tts/spark/voices, GET /audio/tts/spark/info |
core/tts_sparktts.py |
sparktts ≥0.1 (Apache-2) — new dep |
M | Apache-2 | Best CPU-native zero-shot TTS now available; replaces ElevenLabs cloud dependency for offline users |
| L2.4 | VidMuse video-to-music (CVPR 2025, MIT) — generate background music that is semantically and rhythmically synchronised to video content (scene cuts, motion energy, visual tempo). Unlike MusicGen (text-only) and ACE-Step (lyric-focused), VidMuse reads the video directly to compose a matching score. POST /audio/music/vidmuse returns a stereo audio file matched to the input clip duration. |
POST /audio/music/vidmuse, GET /audio/music/vidmuse/info |
core/music_vidmuse.py |
VidMuse (MIT) — new dep | L | MIT | Only local tool that composes music from the video itself rather than text prompts |
| L2.5 | Chat-based editing agent — natural-language edit interface inside the panel: "remove all silences", "add a zoom punch on the word 'launch'", "translate captions to Spanish and French". Routes each instruction to the correct OpenCut API via a lightweight intent router (function-calling LLM against the 1,335 route schemas). Supports local (Ollama + llama-3.1) and cloud (OpenAI / Anthropic) backends, user-configurable. MCP server (L1.3) provides the tool definitions. | POST /agent/chat, GET /agent/chat/history, DELETE /agent/chat/history |
core/agent_chat.py |
openai ≥1.0 SDK (MIT) for cloud mode; Ollama REST (no pip dep) for local — no new required dep if user has Ollama |
L | MIT / Apache-2 (user-configurable backend) | Wave L flagship: positions OpenCut as agent-native rather than just API-rich |
| L2.6 | Moonshine real-time ASR (Useful Sensors / Moonshine AI, 2025 — MIT for English models) — Whisper-compatible API with 10× faster inference on CPU, specifically optimised for streaming and edge devices. Adds moonshine as a fourth STT backend in /captions/backends. English-only MIT licence; multilingual models use a community licence (non-commercial) and are gated separately. Use case: live caption preview without GPU. |
POST /audio/transcribe/moonshine, GET /audio/transcribe/moonshine/info |
core/asr_moonshine.py |
moonshine ≥0.1 (MIT for English) — new dep |
M | MIT (English model) | Fastest CPU ASR available; enables real-time caption preview on low-spec machines |
Goal: Promote five Wave K Tier-3 stubs to full working implementations. All supporting backends (EchoMimic V3, CosyVoice2, SEA-RAFT, Depth Pro, LTX-2) shipped in Wave K — the only gap is the orchestration layer.
New required deps: None (all backends already installed via Wave K)
New routes: ~14
| # | Feature | Route(s) | Module | Builds on | Effort | Notes |
|---|---|---|---|---|---|---|
| L3.1 | Full local video dubbing pipeline (K3.1 promotion) — end-to-end: WhisperX STT → NLLB-200 translate (L1.2 reuse) → CosyVoice2 zero-shot voice clone → EchoMimic V3 lip sync → FFmpeg composite with original audio replace. Private, free, local. HeyGen charges per-minute; this is the OSS alternative. | POST /dub/pipeline, GET /dub/pipeline/backends, GET /dub/pipeline/status/{job_id} |
core/dub_pipeline.py |
K2.4 CosyVoice2 + K2.5 EchoMimic V3 + L1.2 caption translate | L | Priority: unblocks avatar pipeline and multilingual content creation |
| L3.2 | Auto trailer / promo generator (K3.2 promotion) — LLM-scored moment extraction (emotion arc + virality score + face presence) → top-N clip selection → MusicGen ramp + title card (declarative_compose) + CTA overlay + auto-paced cut rhythm. All component pieces in OpenCut from prior waves; the conductor (pipeline module) is the only gap. Descript Underlord charges for this. | POST /generate/trailer, POST /generate/promo, GET /generate/trailer/presets |
core/trailer_gen.py |
K1.3 virality + A2.5 emotion arc + Wave I declarative_compose + K2.18 audio-reactive | M | Wire existing outputs; conductor logic is the effort |
| L3.3 | Sports / genre-agnostic highlights (K3.8 promotion) — optical flow velocity peak (SEA-RAFT) + YAMNet crowd energy + laughter/cheer detection + face-count burst combine into a per-frame excitement score. Top-N segments extracted with 2-second padding. Works for sports, concerts, events, gaming clips — not just talking-head. OpusClip ClipAnything charges per clip. | POST /analyze/highlights/sports, GET /analyze/highlights/genres, GET /analyze/highlights/info |
core/highlights_sports.py |
K2.9 SEA-RAFT + H1.4 YAMNet (or AudioGen CLAP) + A2.5 emotion arc | M | --genre param gates which scoring functions activate |
| L3.4 | EchoMimic V3 talking head (K2.5 promotion from "stub" to "tested + recommended") — full integration test suite, half-body mode, reference-audio conditioning, and promotion to recommended: true in /lipsync/backends. Unblocks L3.1 dub pipeline. |
POST /lipsync/echomimic, GET /lipsync/echomimic/presets, GET /lipsync/echomimic/info |
core/lipsync_echomimic.py (extend existing stub) |
K2.5 EchoMimic V3 stub | M | Integration tests required before recommended: true flag |
| L3.5 | AI CineFocus rack focus (K2.19 promotion) — depth-of-field bokeh using Depth Pro metric depth: keyframeable focal point, aperture shape, f-number slider, rack-focus animation (focus-pull from background to foreground over N frames). DaVinci Resolve 21 CineFocus requires a paid licence. OpenCut ships free. | POST /video/cinefocus/render, POST /video/cinefocus/preview, GET /video/cinefocus/presets |
core/cinefocus.py |
K2.13 Depth Pro + FFmpeg boxblur | M | Expose focal_x, focal_y as keyframeable float params |
Tools discovered in the April 2026 research pass that are not mentioned in any prior roadmap wave. Listed with licence, stars tier, and the OpenCut feature each enables.
| Tool | Org | Licence | Status | Relevance |
|---|---|---|---|---|
| FramePack | lllyasviel | Apache-2 | NeurIPS 2025 — production weights on HuggingFace | L2.1 — fourth T2V backend, 6 GB GPU viable |
| FLUX.1 Kontext (dev variant) | Black Forest Labs | Apache-2 (dev only) | June 2025 — instruction-guided image/frame editing | Evaluate for Wave M: POST /video/frame-edit/kontext — edit a single frame then propagate via TokenFlow |
| Wan 2.2 | Alibaba Wan-Video | Apache-2 | Q2 2026 — follow-up to Wan 2.1 with improved motion quality | Evaluate: upgrade existing C4 Wan2.1 backend when stable |
| Tool | Org | Licence | Status | Relevance |
|---|---|---|---|---|
| ACE-Step (v1.5, Jan 2026) | ACE Studio + StepFun | Apache-2 | Production — 3.5B, RTX 3090 in 4.7 s/min | L2.2 — full-song music with lyrics, voice clone, lyric edit |
| Spark-TTS | SparkAudio | Apache-2 | March 2026 — ONNX-ready, CPU-native | L2.3 — best CPU-native zero-shot TTS; replaces cloud ElevenLabs for offline use |
| VidMuse | — (CVPR 2025) | MIT | Research release | L2.4 — video-conditioned music composition |
| Tool | Org | Licence | Status | Relevance |
|---|---|---|---|---|
| Moonshine (English) | Useful Sensors / Moonshine AI | MIT (English models) | Stable — C++/Python/Android/iOS | L2.6 — 10× faster than Whisper on CPU; real-time caption preview |
| Tool | Licence | Key Differentiator | Gap vs OpenCut |
|---|---|---|---|
| mifi/lossless-cut | GPLv2 | FFmpeg-first lossless cut planning + smart UX | No AI features; OpenCut laps it on AI but its silence-threshold UX defaults are worth borrowing |
| WyattBlue/auto-editor | MIT | CLI silence removal with sub-second previews | OpenCut bulk processor (L1.9) closes the gap; auto-editor's --motion mode is a reference for sports highlights (L3.3) |
| Crayotter | MIT | Multimodal agentic video editor (GPT-4o + FFMPEG) | OpenCut chat agent (L2.5) closes this gap with local model support |
| hetpatel-11/Adobe_Premiere_Pro_MCP | MIT | MCP server bridging Claude/Codex → Premiere scripting | OpenCut MCP server (L1.3) closes this gap but targets the backend layer (1,335 routes) not just Premiere's scripting layer |
| FireRed/OpenStoryline | Apache-2 | Style Skills — LLM-generated motion/FX presets | Evaluate for Wave M: "Style Skills" tab in the effects panel |
| sebinside/PremiereRemote | MIT | WebSocket trigger surface for Premiere from external tools | OpenCut MCP server (L1.3) + chat agent (L2.5) cover the use case more broadly |
| Competitor Feature | OpenCut Gap (pre-L) | Wave L closes it |
|---|---|---|
| Captions.ai — multilingual auto-captions | No translation route | L1.2 caption translate (NLLB-200, 200 languages) |
| Captions.ai — AI face retouch | No skin retouching | L1.7 blemish removal + L1.6 face reshaper |
| OpusClip — sports highlights | K3.8 stub unshipped | L3.3 promotion |
| HeyGen — video dubbing pipeline | K3.1 stub unshipped | L3.1 promotion |
| Descript Underlord — trailer generator | K3.2 stub unshipped | L3.2 promotion |
| DaVinci Resolve 21 — CineFocus | K2.19 stub unshipped | L3.5 promotion |
| Suno v5.5 / Udio — full song gen | No local equivalent | L2.2 ACE-Step (Apache-2, local) |
| ElevenLabs — SOTA TTS | Cloud-gated | L2.3 Spark-TTS (CPU-native Apache-2) |
| Runway Gen-4.5 — agent / API surface | 1,335 routes, no MCP | L1.3 MCP server (any AI client can drive OpenCut) |
| Premiere Pro 26 — AI chat editing | No chat interface | L2.5 chat agent (local + cloud backend) |
| Topaz Video AI — batch processing | No batch API | L1.9 bulk clip operations |
- SSE + Flask dev server (L1.1) — Flask's built-in server buffers responses; SSE requires
threaded=TrueandResponse(stream_with_context(...)). Under waitress/gunicorn usetext/event-streamwith explicitX-Accel-Buffering: noheader. Test withcurl -Nbefore wiring the panel. - NLLB-200 1.3B cold-load time (L1.2) — ~8 s on first request. Lazy-load and keep resident in a module-level singleton guarded by a
threading.Lock. Do not reload per request. Memory footprint: ~3 GB. Warn users with < 8 GB RAM in the 503 install hint. - MCP server port conflict (L1.3) — MCP over stdio is the safe default (no port). HTTP transport is optional but requires a free port; document in the install hint. Never auto-start MCP server at extension load; user must explicitly launch
python -m opencut.mcp_server. - FramePack sequential inference (L2.1) — FramePack generates frames one-at-a-time, not in batch. Progress callback should emit per-frame percentage. Cap output at 240 frames (10 s at 24fps) by default to bound inference time; expose
--max_framesas an advanced param. - ACE-Step Windows torch.compile (L2.2) —
--torch_compile truerequirestriton-windowson Windows; surface this in the 503 install hint rather than hard-requiring it. Default to--torch_compile falseon Windows.--cpu_offload true --overlapped_decode trueshould be the Windows default to fit within 8 GB VRAM. - Spark-TTS ONNX path (L2.3) — The PyPI
sparkttspackage is the Python reference; the ONNX runtime path (viaonnxruntime) is 3–5× faster on CPU. Detectonnxruntimeat runtime and prefer it; surface conversion instructions if missing. - VidMuse GPU memory (L2.4) — VidMuse requires 24 GB VRAM at full resolution. Add
--resize_input 512as a default to bring it within 12 GB; document the quality trade-off in the 503 hint. - Chat agent hallucinated routes (L2.5) — the intent router must validate every candidate route against the live
/routes/listendpoint before calling it. Never let the LLM construct routes from memory; always look them up. Rate-limit at 5 concurrent chat requests to prevent runaway job queues. - Moonshine multilingual gate (L2.6) — English models are MIT; non-English models are Moonshine Community Licence (non-commercial, registration required). Gate them separately:
check_moonshine_en_available()vscheck_moonshine_multilingual_available(). Never auto-download non-English models silently. - EchoMimic V3 backend priority (L3.4) — when promoting to
recommended: true, do not silently redirect existing MuseTalk/LatentSync requests. Keep the old backends available; only set the newrecommendedflag so the panel defaults to it for new sessions. - Bulk processor job cancellation (L1.9) — individual jobs within a batch can fail without cancelling the whole batch. Return
{completed: N, failed: M, skipped: K}in the summary. ExposePOST /clips/bulk/cancel/{batch_id}to stop in-flight batches.
| Release | Target | Key deliverables |
|---|---|---|
| v1.33.0 | 2026-Q3 | L1.1 SSE streaming, L1.2 caption translate, L1.3 MCP server, L1.4 upscaling hub, L1.5 ElevenLabs TTS, L1.6 face reshaper, L1.7 skin retouch, L1.8 job history, L1.9 bulk ops |
| v1.34.0 | 2026-Q3 | L2.1 FramePack, L2.2 ACE-Step, L2.3 Spark-TTS, L2.4 VidMuse, L2.5 chat agent, L2.6 Moonshine ASR |
| v1.35.0 | 2026-Q4 | L3.1 dub pipeline, L3.2 trailer gen, L3.3 sports highlights, L3.4 EchoMimic V3, L3.5 CineFocus |
- FLUX.1 Kontext pro variant — commercial licence; dev variant (Apache-2) is viable but requires testing on single-frame edit + TokenFlow propagation workflow. Defer to Wave M once that pipeline is validated.
- Moonshine multilingual models — Moonshine Community Licence requires registration for revenue >$1M; safe for OpenCut but requires explicit user opt-in and a separate
check_moonshine_multilingual_available()gate. Defer multilingual support to Wave M. - ReEzSynth Ebsynth style propagation — research.md §8 Tier 4. Licence unclear at time of survey; revisit when clarified.
- FlashVSR real-time VSR — CVPR 2026 weights pending public release. Add to Wave M watch list.
- STAR temporal coherence post-processor — research-only weights at time of survey. Revisit for Wave M.
- Sammie-Roto-2 rotoscoping — research-only weights. Cutie (K2.7) covers the tracking use case under MIT. Revisit if licence clarifies.
- HappyHorse 1.0 T2V — licence TBD at time of survey. Revisit for Wave M.
- GaussianHeadTalk / FantasyTalking2 — early research releases; EchoMimic V3 (K2.5) is more mature and production-ready. Revisit after L3.4 ships.
- Digital twin / AI avatar pipeline — requires stable L3.1 dub + L3.4 EchoMimic + brand-kit identity data. Planned for Wave M as a first-class pipeline once all component pieces are hardened in Wave L.
- Plugin marketplace / hub — architectural dependency on stable MCP server (L1.3) and job history (L1.8). Defer to Wave M once those foundations are proven.
- FramePack — lllyasviel/FramePack (NeurIPS 2025, Apache-2)
- ACE-Step — ace-step/ACE-Step v1.5 (Jan 2026, Apache-2); Technical Report arXiv:2506.00045
- Spark-TTS — SparkAudio/Spark-TTS (Mar 2026, Apache-2)
- VidMuse — CVPR 2025, MIT licence (weights on HuggingFace)
- Moonshine — moonshine-ai/moonshine (MIT for English, Moonshine Community Licence for multilingual)
- FLUX.1 Kontext — Black Forest Labs (June 2025, Apache-2 dev variant, commercial pro variant)
- MCP SDK — modelcontextprotocol/python-sdk (MIT)
- NLLB-200 — Meta AI (Apache-2); research paper
- Crayotter — idwts/Crayotter (MIT, agentic video editor reference)
- Premiere Pro MCP — hetpatel-11/Adobe_Premiere_Pro_MCP (MIT, MCP architecture reference)
- OpenCut
research.md— April 2026 competitive audit, §1–§9, Wave L priorities §8 - OpenCut
ROADMAP-NEXT.md— Wave K stubs (K2.5, K2.19, K3.1, K3.2, K3.8) promoted in L3
Updated: 2026-06-15 Baseline: v1.35.0 (post-Wave L; MCP server, job history, FramePack, ACE-Step, Spark-TTS, Moonshine English, dub pipeline, trailer gen, EchoMimic V3 all shipped) Research pass: June 2026 GitHub OSS survey — Chatterbox, Kokoro, DiffRhythm, Wan2.2 family
This wave synthesises:
- Four new OSS tools confirmed in this research pass not yet referenced in any prior roadmap document (Chatterbox TTS, Kokoro, DiffRhythm v1.2, Wan2.2 family)
- Features deferred from Wave L's "Not Adopted" list that now have their required dependencies hardened (Moonshine multilingual, FLUX.1 Kontext, plugin marketplace, digital twin pipeline)
- A video model upgrade path: Wan2.1 K3.7 VACE stub → Wan2.2 full family (T2V, S2V, Animate)
All guiding principles from prior waves apply: never break what works, one new required dep per feature maximum, permissive licences only, check_X_available() guard + @async_job + queue allowlist.
Goal: Dramatically improve the TTS and music generation ecosystem with three newly confirmed permissive-licence models, and promote the deferred Moonshine multilingual gate.
New required deps: chatterbox-tts (MIT), kokoro ≥0.9.4 (Apache-2), DiffRhythm inference script + requirements.txt (Apache-2), espeak-ng system dep (shared by Kokoro + DiffRhythm)
New routes: ~16
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| M1.1 | Chatterbox TTS (Resemble AI, MIT) — Chatterbox-Turbo (350M) is the fastest emotional open-source TTS available: paralinguistic tags [laugh], [chuckle], [cough] baked into the model architecture, zero-shot voice cloning from a 10-second clip, and a multilingual variant (500M, 23 languages) that benchmarks ahead of ElevenLabs Turbo v2.5 and Cartesia Sonic 3 in independent Podonos evaluation. Built-in Perth perceptual watermarking. Ships as two backends: chatterbox-turbo (English, GPU preferred) and chatterbox-multilingual (23 langs, gated behind a second check_chatterbox_multilingual_available() guard). Supersedes the ElevenLabs cloud fallback (L1.5) for users who can run local. |
POST /audio/tts/chatterbox, POST /audio/tts/chatterbox/multilingual, GET /audio/tts/chatterbox/voices, GET /audio/tts/chatterbox/info |
core/tts_chatterbox.py |
chatterbox-tts ≥0.1 (MIT) — new dep |
M | MIT | Only open-source TTS with native paralinguistic emotion tags; beats ElevenLabs in naturalness benchmarks; zero-shot cloning |
| M1.2 | Kokoro ultralight TTS (hexgrad, Apache-2) — 82M parameter TTS, CPU-only (pip install kokoro), 24 kHz output, 9 languages (US/UK English, Spanish, French, Hindi, Italian, Japanese, Portuguese, Mandarin). No CUDA required whatsoever. Fastest possible TTS for low-spec machines and CI-pipeline preview renders. Ships as the new last-resort TTS fallback below Spark-TTS in the priority chain: Chatterbox Turbo → Spark-TTS → Kokoro → (error). Requires espeak-ng system package for G2P OOD fallback. |
POST /audio/tts/kokoro, GET /audio/tts/kokoro/voices, GET /audio/tts/kokoro/info |
core/tts_kokoro.py |
kokoro ≥0.9.4 + misaki + espeak-ng system dep (Apache-2) — new dep |
S | Apache-2 | 82M params, CPU-only, pip install kokoro; adds a "works on any machine" fallback tier for TTS |
| M1.3 | DiffRhythm full-song generation (ASLP-lab, Apache-2) — first diffusion-based full-length song generator; base model outputs 1m35s, full model outputs up to 4m45s. Accepts LRC lyrics + optional audio style reference OR text style prompt (e.g., "Jazzy Nightclub Vibe", "Pop Emotional Piano"). DiffRhythm v1.2 adds song editing, continuation, and instrumental-only mode. Complements ACE-Step (L2.2): ACE-Step excels at lyric editing and stem separation; DiffRhythm excels at composing complete songs end-to-end from a text brief. 8 GB VRAM minimum; --chunked flag reduces peak to 8 GB. Requires espeak-ng for lyrics phonemisation (same system dep as M1.2). Windows: must set PHONEMIZER_ESPEAK_LIBRARY and PHONEMIZER_ESPEAK_PATH env vars. |
POST /audio/music/diffrhythm, POST /audio/music/diffrhythm/extend, GET /audio/music/diffrhythm/styles, GET /audio/music/diffrhythm/info |
core/music_diffrhythm.py |
DiffRhythm git clone + requirements.txt (Apache-2) — new dep |
L | Apache-2 | End-to-end song composer from lyrics + style text; up to 285 s; complements ACE-Step |
| M1.4 | Moonshine multilingual ASR — deferred from Wave L "Not Adopted" list. Moonshine Community Licence is non-commercial; acceptable for OpenCut (personal/hobbyist use) with explicit user acknowledgment. Ships behind check_moonshine_multilingual_available() guard that also checks for a stored acknowledgment timestamp in opencut/config.json (moonshine_multilingual_ack). Presents a one-time "Non-commercial use only" consent prompt on first activation. Adds 99-language transcription to the ASR backend chain. |
POST /audio/transcribe/moonshine/multilingual, GET /audio/transcribe/moonshine/multilingual/info |
core/asr_moonshine_multilingual.py |
moonshine multilingual weights (Community Licence) — new dep with user opt-in |
M | Moonshine Community Licence (non-commercial; gated) | Closes 99-language caption gap for non-English creators on CPU hardware; no GPU required |
Goal: Upgrade the Wan2.1 VACE stub (K3.7) to the full Wan2.2 family, add speech-driven talking-head video (S2V) and character animation/replacement (Animate), and integrate FLUX.1 Kontext for AI image editing across video frames.
New required deps: Wan2.2 package (Apache-2, upgrades existing Wan2.1 dep), diffusers ≥0.29 already present (FLUX Kontext uses diffusers pipeline)
New routes: ~14
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| M2.1 | Wan2.2 T2V/I2V/TI2V upgrade (Wan-Video, Apache-2) — replaces the Wan2.1 K3.7 VACE stub with the full Wan2.2 model family. Key improvements over Wan2.1: MoE architecture (larger effective capacity at same compute), +65.6% image training data / +83.2% video training data, cinematic aesthetics labels (lighting, composition, colour tone), and the TI2V-5B consumer model (720P@24fps on a 4090). Integrated into ComfyUI and Diffusers. Routes: create new /generate/wan2.2 family (T2V, I2V, TI2V) with a backwards-compat alias from /generate/wan2.1. TI2V-5B is the default for consumer deployments; A14B models surface a VRAM check and require --offload_model True flag. |
POST /generate/wan2.2/t2v, POST /generate/wan2.2/i2v, POST /generate/wan2.2/ti2v, GET /generate/wan2.2/info |
core/gen_video_wan22.py |
Wan2.2 (git clone Wan-Video/Wan2.2 + requirements.txt, Apache-2) — upgrades existing Wan2.1 dep |
L | Apache-2 | Best open-source T2V; MoE architecture, cinematic aesthetics, 720P@24fps on consumer GPUs (TI2V-5B) |
| M2.2 | Wan2.2-S2V speech-to-video (Wan-Video, Apache-2, Aug 2025) — 14B model that generates a talking-head video from an audio clip + reference portrait image. Given any audio recording (no TTS needed — accepts real voice files), the model generates synchronized lip movements, upper-body motion, and natural facial expression. Optional CosyVoice2 integration (already in OpenCut) enables full text→speech→video mode. Primary use case: narration video without filming — write a script, clone a voice (M1.1 Chatterbox), generate the video. Requires 80 GB VRAM for single-GPU, multi-GPU via FSDP, or --offload_model True for consumer cards. |
POST /generate/wan2.2/s2v, GET /generate/wan2.2/s2v/info |
core/gen_video_wan22_s2v.py |
Wan2.2-S2V-14B weights + requirements_s2v.txt (Apache-2) — CosyVoice2 optional |
L | Apache-2 | Script + voice clone → talking-head video; closes HeyGen/Synthesia gap locally |
| M2.3 | Wan2.2-Animate character animation and replacement (Wan-Video, Apache-2, Sep 2025) — unified model for two workflows: (a) animate a still character photo to match motions from a reference video, (b) replace the character in a video with a different appearance while preserving all movements and expressions. Replicates holistic body movement and facial expression simultaneously. Ships as a complement to EchoMimic V3 (L3.4): EchoMimic excels at portrait-only lip-sync from audio; Animate excels at full-body motion transfer and character swap. | POST /generate/wan2.2/animate, GET /generate/wan2.2/animate/info |
core/gen_video_wan22_animate.py |
Wan2.2-Animate-14B weights (Apache-2) — additional Wan2.2 variant | L | Apache-2 | Full-body character animation + replacement; closes Adobe Character Animator gap locally |
| M2.4 | FLUX.1 Kontext-dev image editing (Black Forest Labs, Apache-2 dev variant) — deferred from Wave L "Not Adopted" list. Context-aware image-to-image editing: accepts an image + natural language instruction and returns the edited result. Primary workflow: apply per-frame AI edits (object removal, style transfer, subject replacement, background swap) and propagate changes across a video clip using TokenFlow (already in OpenCut as a Wave K dep). The dev variant is Apache-2; the pro variant is commercial and is NOT used. Pre-flight check: FLUX Kontext-dev weights are ~24 GB — GET /image/edit/kontext/info returns {size_gb: 24, downloaded: bool} and the install route confirms before downloading. |
POST /image/edit/kontext, POST /video/edit/kontext, GET /image/edit/kontext/info |
core/image_edit_kontext.py |
diffusers ≥0.29 (already present) + FLUX Kontext-dev weights via HuggingFace (~24 GB) |
M | Apache-2 (dev variant only) | Per-frame AI editing propagated to video via TokenFlow; closes Runway/Pika AI object-edit gap |
Goal: Ship two major platform-level features that were deferred from Wave L because they depended on components that were not yet stable. Both dependencies are now hardened in Waves L and M. New required deps: None — chains existing (L1.3 MCP, L1.8 job history, L3.1 dub, L3.4 EchoMimic, M1.1 Chatterbox, M2.2 S2V) New routes: ~12
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| M3.1 | Plugin marketplace / hub — deferred from Wave L. Now that MCP server (L1.3) and job history (L1.8) are production-hardened, build a plugin registry: GET /plugins/list returns a manifest of installed + available community plugins; POST /plugins/install downloads, validates SHA-256 manifest, and registers the plugin; each plugin exposes new routes via MCP tool schemas validated against the /routes/list contract. Ships a built-in plugin SDK (opencut/plugin_sdk.py) with a plugin template, route validator, and schema generator. Initial curated set: 5 first-party plugins (batch caption translate, bulk denoise, VidMuse auto-score, Chatterbox narration, export preset packs). Community plugins require explicit user consent. |
GET /plugins/list, POST /plugins/install, DELETE /plugins/{id}, GET /plugins/{id}/schema, GET /plugins/{id}/routes |
core/plugin_manager.py, opencut/plugin_sdk.py |
Chains L1.3 MCP + L1.8 job history — no new dep | L | — | First truly extensible OpenCut architecture; community-driven feature growth post-Wave M |
| M3.2 | Digital twin / AI avatar pipeline — deferred from Wave L. Full end-to-end localisation pipeline: (1) clone voice from a 10-second reference clip using Chatterbox TTS (M1.1), (2) generate narration audio from the script, (3) generate lip-sync talking-head video using Wan2.2-S2V (M2.2) or EchoMimic V3 (L3.4), (4) translate and dub the output into target languages using the dub pipeline (L3.1), (5) composite the avatar onto original footage. Exposed as a single POST /pipeline/digital_twin endpoint accepting {script, voice_ref_path, face_ref_path, target_langs[]} and returning a completed per-language dubbed video package. Each stage is independently skipable if pre-existing assets are provided. |
POST /pipeline/digital_twin, GET /pipeline/digital_twin/info, GET /pipeline/digital_twin/stages |
core/pipeline_digital_twin.py |
Chains M1.1 Chatterbox, M2.2 S2V, L3.1 dub, L3.4 EchoMimic — no new dep | XL | Combined (MIT + Apache-2) | Complete localisation pipeline — script-in, multilingual-dubbed-video-out; closes CapCut AI dubbing gap |
| Tool | Org | Licence | Category | Wave M Role |
|---|---|---|---|---|
| Chatterbox TTS | Resemble AI | MIT | TTS | M1.1 — Emotional TTS + voice clone |
| Chatterbox-Multilingual | Resemble AI | MIT | TTS | M1.1 — 23-language variant |
| Kokoro-82M | hexgrad | Apache-2 | TTS | M1.2 — 82M CPU-only fallback TTS |
| DiffRhythm v1.2 | ASLP-lab | Apache-2 | Music Gen | M1.3 — End-to-end song generation |
| DiffRhythm+ / DiffRhythm 2 | ASLP-lab | Apache-2 | Music Gen | Watch list — not yet stable; Wave N |
| Moonshine multilingual | Moonshine AI | Community | ASR | M1.4 — Gated multilingual ASR |
| Wan2.2 T2V/I2V/TI2V | Wan-Video | Apache-2 | T2V | M2.1 — Upgrades Wan2.1 K3.7 stub |
| Wan2.2-S2V-14B | Wan-Video | Apache-2 | Talking Head | M2.2 — Speech-to-video generation |
| Wan2.2-Animate-14B | Wan-Video | Apache-2 | Char Anim | M2.3 — Character animation/replacement |
| FLUX.1 Kontext-dev | Black Forest Labs | Apache-2 | Image Edit | M2.4 — Per-frame AI editing |
| Fish-Speech v1.5 | Fish Audio | FARL | TTS | NOT ADOPTED — non-commercial licence |
| LightX2V | ModelTC | Apache-2 | Inference Acc | Watch list — Wan2.2 acceleration; Wave N |
| FastVideo | hao-ai-lab | Apache-2 | Inference Acc | Watch list — distilled Wan2.2; Wave N |
| Gap | Competitor | Wave M Feature | Closes? |
|---|---|---|---|
| Emotional/paralinguistic TTS local | ElevenLabs | M1.1 Chatterbox | Y — free, local, MIT; beats ElevenLabs in benchmarks |
| Full-length song from lyrics text | Suno, Udio | M1.3 DiffRhythm | Y — up to 4m45s, text or audio reference |
| Audio-driven talking-head video | HeyGen, Synthesia | M2.2 Wan2.2-S2V | Y — local, any audio recording as input |
| Character replacement + animation | Adobe Character Animator | M2.3 Wan2.2-Animate | Y — full-body motion + expression transfer |
| AI object/subject editing in video | Runway Gen-3, Pika 2.0 | M2.4 FLUX Kontext | Y — per-frame editing propagated via TokenFlow |
| Plugin / extension ecosystem | Premiere Pro extensions | M3.1 Plugin marketplace | Y — MCP-native SDK, community plugins |
| End-to-end script-to-dubbed-video | CapCut AI dubbing, HeyGen | M3.2 Digital twin | Y — script + face ref + voice ref → multilingual video |
- Chatterbox on Windows: Resemble AI developed on Debian 11. Windows users will need
espeak-ngfor phonemisation fallback (same as DiffRhythm). DocumentPHONEMIZER_ESPEAK_LIBRARYandPHONEMIZER_ESPEAK_PATHin the M1.1 503 install hint. - DiffRhythm espeak-ng env vars:
check_diffrhythm_available()must validate both env vars are set on Windows before allowing the model to load. Fail with a clear 503 hint that includes the MSI download URL and the two env var names. - Wan2.2 A14B models: Both T2V-A14B and I2V-A14B require 80 GB VRAM for single-GPU inference. Gate these variants with
check_wan22_highvram_available()(checkstorch.cuda.mem_get_info()for 76+ GB free). Surface a 503 that recommends TI2V-5B for consumer cards. - Wan2.2-S2V CosyVoice2 dep: S2V has an optional CosyVoice2 dependency (
requirements_s2v.txt) for the text→speech→video mode. Keep this optional — S2V works perfectly well with a pre-recorded audio file. Install hint should clarify the two modes. - FLUX Kontext-dev weight size: ~24 GB download. Add a size-check in
check_kontext_available()that returns{available: false, reason: "weights_not_downloaded", size_gb: 24}when absent. ThePOST /image/edit/kontext503 should include an explicit download confirmation step, not a silent auto-download. - Moonshine multilingual licence gate: Never auto-start the multilingual model without a stored acknowledgment. The
check_moonshine_multilingual_available()guard must verifyopencut/config.json["moonshine_multilingual_ack"]exists and is a valid ISO 8601 timestamp before returningavailable: true. - Fish-speech is NOT viable: Fish Audio Research License explicitly requires a separate commercial license for any deployment beyond personal research. Do not adopt regardless of model quality. Chatterbox (MIT) or Kokoro (Apache-2) are the correct alternatives.
- DiffRhythm 2 / DiffRhythm+: Both are newer variants with separate papers (arXiv:2507.12890 for DiffRhythm+) but were still maturing at the time of this research pass. Integrate DiffRhythm v1.2-base and v1.2-full in M1.3; revisit newer variants for Wave N.
- Digital twin pipeline (M3.2) VRAM budget: Pipeline chains Chatterbox (350M, ~1 GB VRAM) → S2V (14B, ~28 GB offloaded) → dub (CosyVoice2, ~4 GB) → EchoMimic (if chosen, ~6 GB). Total peak per-stage, never concurrent. Coordinate offloading via the existing
ModelRegistryto avoid OOM.
| Release | ETA | Features |
|---|---|---|
| v1.36.0 | 2026-Q4 | M1.1 Chatterbox TTS, M1.2 Kokoro, M1.3 DiffRhythm, M1.4 Moonshine multilingual |
| v1.37.0 | 2026-Q4 | M2.1 Wan2.2 upgrade, M2.2 S2V, M2.3 Animate, M2.4 FLUX Kontext |
| v1.38.0 | 2027-Q1 | M3.1 Plugin marketplace, M3.2 Digital twin pipeline |
- Fish-speech (Fish Audio Research License) — non-commercial only; Fish Audio requires a separate commercial license for any production deployment. NOT viable. Use Chatterbox (MIT) or Kokoro (Apache-2) instead.
- DiffRhythm+ / DiffRhythm 2 — Apache-2 but newer variants still maturing at time of survey. DiffRhythm v1.2 is the stable target. Monitor for Wave N.
- Wan2.2-T2V-A14B / I2V-A14B consumer — 80 GB VRAM requirement makes these impractical on consumer hardware. TI2V-5B (M2.1) covers the consumer use case. Defer A14B consumer offload optimisation to Wave N.
- LightX2V acceleration (ModelTC, Apache-2) — step-distillation and sparse-attention wrappers for Wan2.2; not yet production-stable. Watch list for Wave N once upstream inference stabilises.
- FastVideo distilled Wan2.2 (hao-ai-lab, Apache-2) — distilled Wan models with sparse attention; same stability concern as LightX2V. Wave N watch list.
- FLUX.1 Kontext-pro — commercial licence. Dev variant (Apache-2) is the only viable path. Do not use pro variant.
- Kokoro.js — web-only JavaScript variant; not relevant to the Python/Flask backend.
- HunyuanVideo (Tencent) — Tencent Community License prohibits commercial use for >100M MAU and is geo-restricted (excludes EU, UK, South Korea). Already excluded in Wave K. Confirmed NOT viable.
- Chatterbox — resemble-ai/chatterbox (MIT, May 2025); Podonos evaluations: vs ElevenLabs Turbo v2.5, vs Cartesia Sonic 3
- Kokoro — hexgrad/kokoro (Apache-2); Kokoro-82M on HuggingFace; 82M params, CPU-only
- DiffRhythm — ASLP-lab/DiffRhythm (Apache-2, March 2025); arXiv:2503.01183; DiffRhythm+ arXiv:2507.12890; v1.2 released May 2025
- Wan2.2 — Wan-Video/Wan2.2 (Apache-2, July 2025); arXiv:2503.20314; S2V added Aug 2025; Animate added Sep 2025
- FLUX.1 Kontext — black-forest-labs/flux (Apache-2 dev variant, June 2025);
model_licenses/LICENSE-FLUX1-devconfirmed Apache-2 - Fish-speech — fishaudio/fish-speech — confirmed Fish Audio Research License (non-commercial); NOT adopted
- Wave L Not Adopted list — Moonshine multilingual, FLUX.1 Kontext, plugin marketplace, digital twin pipeline all promoted to Wave M
- Wave K stubs — K3.7 Wan2.1 VACE upgraded to Wan2.2 full family in M2.1
Updated: 2026-04-24 Baseline: v1.38.0 (post-Wave M; FastVideo/LightX2V not yet integrated; Wan2.2 S2V, Animate, DiffRhythm, Chatterbox, Kokoro, FLUX Kontext, plugin marketplace, digital twin pipeline all shipped) Research pass: April 2026 GitHub OSS survey — FastVideo, LightX2V, SAM2, Depth-Anything-V2, CogVideoX, Qwen2.5-VL, CSM
This wave synthesises:
- Two inference-acceleration frameworks (FastVideo, LightX2V) that make the Wave M Wan2.2 models viable on mid-range consumer hardware by trading quality-neutral for speed gains
- Two scene-intelligence primitives (SAM2 video segmentation, Depth-Anything-V2 depth maps) that unlock a new class of compositor effects not possible without per-frame spatial understanding
- Three content-intelligence and model-expansion features (CogVideoX-5B, Qwen2.5-VL smart timeline, DiffRhythm+ upgrade) that close remaining gaps in the creative toolbox
- Deferred items from Wave M watch list that have now stabilised (DiffRhythm 2, LightX2V Wan2.2 I2V A14B 4-step)
Goal: Make all Wan2.2 endpoints from Wave M usable on mid-range consumer hardware without quality degradation. FastVideo adds sparse distillation to the T2V/TI2V path; LightX2V adds quantization + 4-step step-distillation to the I2V A14B path.
New required deps: fastvideo ≥0.1 (Apache-2), lightx2v (Apache-2)
New routes: ~10
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| N1.1 | FastVideo sparse distillation (hao-ai-lab, Apache-2, April 2025) — unified post-training and real-time inference framework for accelerated video generation. Ships distilled FastWan2.1-T2V-1.3B and FastWan2.2-TI2V-5B models that achieve >50× denoising speedup over baseline Wan2.2 via sparse-attention distillation. Demonstrated 5-second 1080P video in 4.5 s on a single 4090. pip install fastvideo; supports Windows, Linux, macOS. Integration: add ?engine=fast query flag on POST /generate/wan2.2/t2v and POST /generate/wan2.2/ti2v that swaps the VideoGenerator backend to FastVideo/FastWan2.2-TI2V-5B-Diffusers. Quality parity for most prompts; falls back to baseline Wan2.2 if fast model is not downloaded. |
POST /generate/wan2.2/t2v?engine=fast, POST /generate/wan2.2/ti2v?engine=fast, GET /generate/wan2.2/fast/info |
core/gen_video_fastvideo.py |
fastvideo ≥0.1 (Apache-2) — new dep |
M | Apache-2 | >50× speedup on existing Wan2.2 routes; enables real-time 1080P generation on single 4090 |
| N1.2 | LightX2V quantization + step distillation (ModelTC, Apache-2, April 2026) — lightweight video generation inference framework offering FP8/INT8 quantization, 4-step step-distilled models (Wan2.2-I2V-A14B in 4 steps instead of 50), sparse attention (≈1.5× additional speedup vs FP8 alone), and disaggregated deployment for multi-GPU. The latest Wan2.2-I2V-A14B-4step-720p weights (April 20, 2026) are trained on a high-quality 720P dataset with low-noise algorithm for better fine-grained detail. Primary impact: Wan2.2 I2V A14B, which previously required 80 GB VRAM at full precision, becomes usable on 24 GB cards at 4-step FP8. Integration: GET /generate/wan2.2/i2v gains ?quant=fp8&steps=4 params; check_lightx2v_available() validates the LightX2V package and distilled weights. |
GET /generate/wan2.2/i2v?quant=fp8&steps=4, GET /generate/wan2.2/i2v/quantization/info, GET /generate/wan2.2/i2v/backends |
core/gen_video_lightx2v.py |
lightx2v (Apache-2) — new dep; Wan2.2-I2V-A14B-4step-720p weights via HuggingFace |
M | Apache-2 | A14B I2V on 24 GB cards via 4-step FP8; up to 42× acceleration combined with CFG distillation |
Goal: Add per-frame spatial understanding to OpenCut. SAM2 enables pixel-perfect video masking and object tracking from user prompts; Depth-Anything-V2 generates per-frame depth maps that feed into depth-of-field, parallax, and smart-reframe effects.
New required deps: sam2 ≥0.4 (Apache-2), depth_anything_v2 (Apache-2)
New routes: ~14
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| N2.1 | SAM 2.1 video object segmentation (Meta FAIR, Apache-2, ECCV 2024) — foundation model for promptable visual segmentation in images and videos. Accepts click, box, or mask prompts on any frame and propagates the segmentation mask throughout the entire video in real time. SAM 2.1 (Sep 2024) is the improved checkpoint suite: 4 sizes (Tiny 38M@91FPS → Large 224M@39FPS on A100). Practical workflows: (a) AI background removal / green-screen replacement without a studio, (b) per-object colour grade, (c) selective blur/mosaic, (d) AI rotoscoping for motion graphics compositing. Renders exported masks as alpha-channel video, matted video, or COCO-format JSON for downstream compositing. Installation: git clone facebookresearch/sam2 && pip install -e .; Windows: WSL recommended for CUDA kernel compilation. |
POST /video/segment/sam2, POST /video/segment/sam2/propagate, GET /video/segment/sam2/info |
core/segment_sam2.py |
sam2 ≥0.4 (Apache-2) — new dep; checkpoints: sam2.1_hiera_small (46M, recommended default) |
L | Apache-2 | Per-frame object tracking + mask propagation; closes Adobe After Effects Roto Brush gap locally |
| N2.2 | Depth-Anything-V2 depth maps (DepthAnything, Apache-2, NeurIPS 2024) — monocular depth estimation foundation model; accepts a single frame (or video), produces a per-pixel depth map. Available in 4 sizes (Small 24M → Large 335M); all run on GPU or CPU. Primary workflows in OpenCut: (a) upgrades AI CineFocus rack-focus (L3.5) with a second depth engine alongside Depth Pro — Depth-Anything-V2 is faster (real-time on GPU) while Depth Pro is more metric-accurate; (b) parallax video effect: separate foreground/background layers by depth and apply independent motion to each, creating a simulated camera-movement 2.5D effect; (c) smart vertical-to-horizontal reframe: depth-guided subject isolation for platform-aware cropping. POST /video/depth/estimate returns a float32 depth map video; POST /video/depth/parallax renders the 2.5D effect directly. |
POST /video/depth/estimate, POST /video/depth/parallax, POST /video/cinefocus/render?engine=depth_anything, GET /video/depth/info |
core/depth_anything_v2.py |
depth_anything_v2 (Apache-2) — new dep; Small model 24M params, CPU-capable |
M | Apache-2 | Real-time depth maps; enables parallax 2.5D, smart reframe, depth-guided compositing |
| N2.3 | SAM2 + depth compositor pipeline — wires N2.1 and N2.2 together into a single POST /video/compose/depth_segment endpoint: (1) SAM2 segments the subject(s), (2) Depth-Anything-V2 estimates depth, (3) compositor combines masks + depth to produce a layered composite with configurable per-layer effects (colour grade, blur, motion parallax, replace background). This closes the biggest remaining gap between OpenCut and professional compositing tools (Adobe After Effects, DaVinci Fusion) for single-subject video editing. No new deps beyond N2.1 + N2.2. |
POST /video/compose/depth_segment, GET /video/compose/depth_segment/info |
core/compose_depth_segment.py |
Chains N2.1 SAM2 + N2.2 Depth-Anything-V2 — no new dep | M | Combined (Apache-2) | End-to-end compositing pipeline; closes After Effects Roto + Depth gap locally |
Goal: Add a second T2V model family for wider hardware coverage (CogVideoX-5B on RTX 3060), add VLM-powered smart timeline analysis (Qwen2.5-VL), and promote DiffRhythm+ (now stable) as an upgrade to the Wave M DiffRhythm v1.2 backend.
New required deps: diffusers ≥0.32 (already present for FLUX Kontext), qwen_vl_utils (Apache-2)
New routes: ~16
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| N3.1 | CogVideoX-5B T2V + I2V (THUDM/Zhipu AI, Apache-2, Aug 2024 / updated Mar 2025) — second T2V family alongside Wan2.2. Key differentiators: (a) runs on RTX 3060 (12 GB VRAM) — far more accessible than Wan2.2 TI2V-5B (16 GB) or A14B (80 GB); CogVideoX-2B even runs on GTX 1080TI; (b) CogVideoX1.5-5B generates 10-second videos at higher resolution with I2V support at any resolution; (c) DDIM Inverse support enables non-destructive video editing (invert → edit → re-generate); (d) LoRA fine-tuning on a single 4090 via cogvideox-factory; (e) available via diffusers (no new framework). Ships as a second backend in /generate/backends alongside Wan2.2; POST /generate/cogvideox for T2V, POST /generate/cogvideox/i2v for I2V, POST /generate/cogvideox/invert for DDIM inversion. Aesthetic: more cinematic/stylised vs Wan2.2's photorealistic. |
POST /generate/cogvideox, POST /generate/cogvideox/i2v, POST /generate/cogvideox/invert, GET /generate/cogvideox/info |
core/gen_video_cogvideox.py |
diffusers ≥0.32 (already present) + CogVideoX-5B weights via HuggingFace (~18 GB) |
L | Apache-2 | Second T2V aesthetic; accessible on 12 GB GPUs; 10s videos; DDIM inversion for video editing |
| N3.2 | Qwen2.5-VL smart timeline analysis (Alibaba/QwenLM, Apache-2, Sep 2024) — vision-language model optimised for visual understanding; accepts video clips and natural-language questions, returns structured analysis. In OpenCut: powers a POST /analyze/video/vl endpoint that answers questions like "describe each scene in this video", "identify all text visible on screen", "list all products shown", "rate the visual quality of each clip". Returns structured JSON with timestamped answers. Feeds two downstream features: (a) AI-assisted search across a clip library by natural-language query (POST /library/search/vl); (b) auto-chapter generation from semantic scene descriptions (complements L3.2 trailer gen). Models: Qwen2.5-VL-7B recommended; Qwen2.5-VL-3B for low-spec machines. Available via pip install transformers qwen_vl_utils. |
POST /analyze/video/vl, POST /library/search/vl, POST /analyze/video/chapters, GET /analyze/video/vl/info |
core/analyze_vl_qwen.py |
qwen_vl_utils (Apache-2) + transformers (already present) + Qwen2.5-VL-7B weights (~16 GB) |
M | Apache-2 | VLM-powered content understanding; closes Descript AI Scenes gap; enables natural-language clip search |
| N3.3 | DiffRhythm+ music upgrade (ASLP-lab, Apache-2, July 2025) — improved version of Wave M's DiffRhythm v1.2 backend. Key improvements in DiffRhythm+ (arXiv:2507.12890): better style control fidelity, improved voice/instrument separation in generated songs, stronger adherence to LRC lyric timing, and support for longer compositions. Ships as a drop-in upgrade to core/music_diffrhythm.py: add DiffRhythm+ as a second variant in GET /audio/music/diffrhythm/info alongside v1.2; route POST /audio/music/diffrhythm accepts ?model=v1.2 or ?model=plus (default: whichever is downloaded, preferring plus). Backwards compatible — no route changes. |
POST /audio/music/diffrhythm?model=plus, GET /audio/music/diffrhythm/models |
core/music_diffrhythm.py (extend existing) |
DiffRhythm+ weights via HuggingFace (Apache-2) — no new pip dep | M | Apache-2 | Drop-in music quality upgrade for Wave M users; better lyric timing and style fidelity |
| N3.4 | Sesame CSM-1B conversational speech (Sesame AI Labs, Apache-2, March 2025) — 1B-parameter speech generation model that produces contextually-aware conversation audio. Unlike Chatterbox (single utterance + emotion) and Kokoro (TTS from text), CSM accepts a conversation context (prior speaker audio + text segments) and generates the next utterance in a natural dialogue style with consistent speaker identity. Primary use case: generating realistic multi-speaker dialogue for explainer videos, podcasts, and AI avatar conversations. Native Transformers support as of v4.52.1 (from transformers import CsmForConditionalGeneration). Gated: requires accepting Meta Llama Community License for the Llama-3.2-1B backbone — add csm_llama_ack timestamp to config.json; consent shown on first activation. English-only. |
POST /audio/speech/csm, GET /audio/speech/csm/context, GET /audio/speech/csm/info |
core/tts_csm.py |
transformers ≥4.52.1 (already present) + CSM-1B weights + Llama-3.2-1B weights (both gated via HuggingFace, requires Meta Community License accept) |
M | Apache-2 (code + CSM weights); Meta Llama Community License (Llama-3.2-1B backbone, gated) | Only local model producing contextual multi-speaker dialogue audio; closes ElevenLabs Conversations gap |
| Tool | Org | Licence | Category | Wave N Role |
|---|---|---|---|---|
| SAM 2.1 | Meta FAIR | Apache-2 | Video Seg | N2.1 — Video object segmentation + tracking |
| Depth-Anything-V2 | DepthAnything | Apache-2 | Depth Est | N2.2 — Monocular depth maps for compositing |
| FastVideo | hao-ai-lab | Apache-2 | Inference Acc | N1.1 — >50× speedup for Wan2.2 T2V/TI2V |
| LightX2V | ModelTC | Apache-2 | Inference Acc | N1.2 — 4-step FP8 Wan2.2 I2V A14B |
| CogVideoX-5B | THUDM/Zhipu AI | Apache-2 | T2V | N3.1 — Second T2V family; 12 GB GPU; 10s videos |
| CogVideoX-2B | THUDM/Zhipu AI | Apache-2 | T2V | N3.1 — Ultra-low-VRAM fallback for older GPUs |
| Qwen2.5-VL | QwenLM/Alibaba | Apache-2 | VLM | N3.2 — Smart timeline analysis + clip search |
| DiffRhythm+ | ASLP-lab | Apache-2 | Music Gen | N3.3 — Quality upgrade for Wave M DiffRhythm |
| CSM-1B | Sesame AI Labs | Apache-2 + Meta Llama CL | Speech | N3.4 — Contextual multi-speaker dialogue TTS |
| AudioCraft (MusicGen) | Meta/facebookresearch | Code MIT, weights CC-BY-NC | Music Gen | NOT ADOPTED — weights non-commercial |
| SUPIR | SupPixel Pty Ltd | Custom non-commercial | Upscaling | NOT ADOPTED — non-commercial weights |
| CodeFormer | S-Lab NTU | S-Lab License (non-commercial) | Face Restore | NOT ADOPTED — non-commercial only |
| Moshi | Kyutai Labs | Multiple licence subdirs | Speech | Watch list — licence structure requires per-component audit; Wave O |
| Gap | Competitor | Wave N Feature | Closes? |
|---|---|---|---|
| Video generation on mid-range GPUs (12 GB) | Runway Gen-4, Pika 2.0 (cloud-only) | N3.1 CogVideoX-5B | Y — local T2V on RTX 3060 |
| Video generation acceleration (consumer) | Sora, Luma Dream Machine (cloud) | N1.1 FastVideo | Y — >50× speedup on 4090 |
| A14B I2V on consumer 24 GB cards | — | N1.2 LightX2V 4-step FP8 | Y — 80 GB → 24 GB via quantization |
| Per-frame video masking (rotoscoping) | After Effects Roto Brush | N2.1 SAM2 | Y — prompt-driven, propagates entire video |
| Depth-of-field + parallax compositor | DaVinci Fusion, AE Camera Lens Blur | N2.2 + N2.3 | Y — real-time depth maps + layered composite |
| Natural-language clip search | Frame.io AI, Descript AI Scenes | N3.2 Qwen2.5-VL | Y — semantic video search across clip library |
| Contextual multi-speaker dialogue audio | ElevenLabs Conversations, Synthesia | N3.4 CSM-1B | Y — local, context-aware, free |
- SAM2 CUDA kernel on Windows: SAM2 requires compiling a custom CUDA extension (
pip install -e .in the cloned repo). On Windows, WSL2 with Ubuntu is the path of least resistance. Without WSL, some post-processing features (connected-component labels) may fail; the mask prediction itself still works. Document this in the N2.1 503 install hint with a fallback message. - SAM2 model size:
sam2.1_hiera_small(46M, 85 FPS) is the recommended default — good accuracy/speed balance.sam2.1_hiera_large(224M, 40 FPS) is better for complex occlusion cases. Gate size selection via?model=small|base|largeparam; defaultsmall. - FastVideo VSA kernel: The Video Sparse Attention kernel requires a build step (
uv pip install fastvideohandles it on supported platforms). If VSA fails to compile, FastVideo falls back to FlashAttention which is still faster than baseline but not 50×. Document theFASTVIDEO_ATTENTION_BACKENDenv var. - LightX2V Windows support: LightX2V was developed primarily for Linux and server deployments. On Windows, the quantization kernels may require manual CUDA toolkit alignment. Note WSL2 as the recommended path for N1.2 on Windows hosts; flag this in the
check_lightx2v_available()output with a{windows_note: true}field. - CogVideoX weight size: CogVideoX-5B is ~18 GB; CogVideoX1.5-5B is ~20 GB; CogVideoX-2B is ~6 GB. Follow the same pre-flight size check pattern as FLUX Kontext (M2.4): return
{available: false, reason: "weights_not_downloaded", size_gb: N}and require explicit download confirmation. - CogVideoX prompt length: The model is trained on long, detailed prompts. Add a
POST /generate/cogvideox/enhance_promptstep (using GLM-4 or local LLM via Ollama) analogous to CogVideoX's ownconvert_demo.pybefore generating. Short prompts produce noticeably weaker results. - Qwen2.5-VL VRAM vs CPU: Qwen2.5-VL-7B is the quality-recommended model (~16 GB weights) but Qwen2.5-VL-3B (~8 GB) runs comfortably on a 12 GB card. Add a
?model=7b|3bparam; default3bto ensure it works on the widest hardware range. - CSM Llama dependency gate: CSM requires Llama-3.2-1B weights which are gated on HuggingFace (requires acceptance of Meta Llama Community License).
check_csm_available()must verify both the CSM weights AND the Llama-3.2-1B weights are present locally before returningavailable: true. The consent prompt should display the Meta Llama Community License URL. Storecsm_llama_acktimestamp inconfig.jsonalongsidemoonshine_multilingual_ack. - AudioCraft / MusicGen weights are CC-BY-NC 4.0: The code is MIT but the model weights are non-commercial. Do not adopt regardless of audio quality. ACE-Step (L2.2) and DiffRhythm (M1.3) are the correct alternatives for music generation.
- DiffRhythm+ maturity: DiffRhythm+ paper (
arXiv:2507.12890) was published July 2025. Confirm production-stable weights are available on HuggingFace before shipping N3.3; if not, fall back to v1.2 and leave plus as a?model=plusstub returning 503 with install hint.
| Release | ETA | Features |
|---|---|---|
| v1.39.0 | 2027-Q1 | N1.1 FastVideo, N1.2 LightX2V |
| v1.40.0 | 2027-Q2 | N2.1 SAM2, N2.2 Depth-Anything-V2, N2.3 depth compositor pipeline |
| v1.41.0 | 2027-Q2 | N3.1 CogVideoX-5B, N3.2 Qwen2.5-VL, N3.3 DiffRhythm+, N3.4 CSM-1B |
- AudioCraft / MusicGen / AudioGen (Meta, facebookresearch/audiocraft) — Code: MIT, but model weights: CC-BY-NC 4.0 (non-commercial). OpenCut cannot ship CC-BY-NC default-on features for commercial workflows. ACE-Step (L2.2) is the correct alternative for music generation; no gap remains.
- SUPIR image restoration (SupPixel Pty Ltd) — Custom SUPIR licence, non-commercial only; commercial use requires written permission from SupPixel. NOT viable. The Wave L1.4 upscaling hub (RealESRGAN + BSRGAN) covers the upscaling use case under permissive licences.
- CodeFormer face restoration (S-Lab NTU) — S-Lab License 1.0, explicitly non-commercial. NOT viable. GFPGAN (already in OpenCut, L1.7) covers the face restoration use case.
- Moshi (Kyutai Labs) — Multiple licence files across subdirectories; per-component licence audit required before adoption. Watch list for Wave O pending audit result.
- Sesame CSM multilingual — CSM-1B is English-only; the model has some non-English capacity from training data contamination but is not officially supported. Multilingual speech generation covered by Chatterbox-Multilingual (M1.1).
- CogVideoX-2B GTX 1080TI ultra-low-spec mode — GTX 1080TI (11 GB) can technically run CogVideoX-2B but with very slow inference (~20+ minutes per clip). Adding a dedicated ultra-low-spec mode adds complexity for marginal value. Default to CogVideoX-5B (12 GB / RTX 3060+) and document 2B as a community-supported variant.
- Wan2.2 A14B single-card full-precision — 80 GB VRAM requirement makes single-card consumer deployment infeasible. LightX2V N1.2 FP8 4-step is the practical path for A14B on 24 GB. Wave O may revisit with INT4 GGUF quantization.
- LTX-2 audio-video generation (Lightricks) — LightX2V supports LTX-2 (Jan 2026). LTX-2 licence TBD at time of this survey. Monitor for Wave O.
- SAM 2 / SAM 2.1 — facebookresearch/sam2 (Apache-2, July 2024; SAM 2.1 Sep 2024); arXiv:2408.00714; 4 model sizes 38M–224M
- Depth-Anything-V2 — DepthAnything/Depth-Anything-V2 (Apache-2, June 2024); NeurIPS 2024; Small 24M → Large 335M
- FastVideo — hao-ai-lab/FastVideo (Apache-2, April 2025);
FastWan2.2-TI2V-5B-Diffusersdistilled model;pip install fastvideo; Windows supported; real-time 1080P demo March 2026 - LightX2V — ModelTC/lightx2v (Apache-2);
Wan2.2-I2V-A14B-4step-720pweights April 20, 2026; FP8 + NVFP4 quantization; combined 42× acceleration - CogVideo / CogVideoX — THUDM/CogVideo (Apache-2, Aug 2024); CogVideoX-5B on RTX 3060; CogVideoX1.5-5B for 10s videos; diffusers native; arXiv:2408.06072
- Qwen2.5-VL — QwenLM/Qwen2.5-VL (Apache-2, Sep 2024); 3B and 7B consumer models;
pip install transformers qwen_vl_utils - DiffRhythm+ — ASLP-lab/DiffRhythm (Apache-2); DiffRhythm+ arXiv:2507.12890 (July 2025)
- Sesame CSM-1B — SesameAILabs/csm (Apache-2, March 2025); HuggingFace native as of
transformersv4.52.1; requires Llama-3.2-1B backbone (Meta Community License, gated) - AudioCraft — facebookresearch/audiocraft (code MIT, weights CC-BY-NC 4.0); NOT adopted — weights non-commercial
- SUPIR — Fanghua-Yu/SUPIR — Custom SUPIR licence, non-commercial; NOT adopted
- CodeFormer — sczhou/CodeFormer — S-Lab License 1.0, non-commercial; NOT adopted
Updated: 2026-04-24 Baseline: v1.41.0 (post-Wave N; CogVideoX-5B, SAM2, Depth-Anything-V2, Qwen2.5-VL, FastVideo, LightX2V, DiffRhythm+, CSM-1B all shipped) Research pass: April 2026 GitHub OSS survey — Dia, Parler-TTS, LTX-Video, YuE, LTX-2, HunyuanVideo, F5-TTS, Mochi-1, Stable Audio Open
This wave covers three distinct domains:
- Two new TTS models that fill gaps in the voice synthesis suite: Dia delivers production-quality multi-speaker scripted dialogue with nonverbal sounds; Parler-TTS adds natural-language voice description as a new interaction paradigm distinct from voice-cloning (Chatterbox) and preset voices (Kokoro).
- LTX-Video (LTXV 0.9.8) as a fourth T2V engine — the fastest DiT-based model available under a fully permissive Apache-2 licence, with up to 60-second video, multi-keyframe conditioning, and I2V support.
- YuE — the first fully open-source (Apache-2) lyrics-to-full-song model capable of generating complete songs with vocals and accompaniment, filling the only remaining major audio creation gap in OpenCut.
Goal: Add two TTS models that cover usage patterns not addressed by Chatterbox (audio-prompt voice clone), Kokoro (style-preset TTS), or CSM-1B (dialogue context): (a) scripted multi-speaker dialogue with nonverbal sounds for podcast/explainer workflows; (b) natural-language voice description for any-voice TTS without a reference audio clip.
New required deps: dia (Apache-2), parler_tts ≥0.1 (Apache-2)
New routes: ~12
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| O1.1 | Dia 1.6B dialogue TTS (Nari Labs, Apache-2, April 2025) — 1.6B parameter model that generates fully-scripted multi-speaker dialogue from a transcript using [S1]/[S2] speaker tags. Unique capabilities: (a) nonverbal generation (laughs), (coughs), (sighs), (applause), etc. — first local model to do this convincingly; (b) voice cloning from 5–10s reference audio; (c) native HuggingFace Transformers support (DiaForConditionalGeneration); (d) 4.4 GB VRAM on bfloat16. English only. Dia2 (the successor) also released Nov 2025 on GitHub. Use case: scripted podcast creation, explainer video narration with two presenters, training video VO. POST /audio/speech/dia accepts array of `{speaker: "S1" |
"S2", text: "...", nonverbals: ["(laughs)"]}` entries and returns a single rendered audio file. | POST /audio/speech/dia, POST /audio/speech/dia/preview, GET /audio/speech/dia/info |
core/tts_dia.py |
dia (Apache-2) — new dep; Dia-1.6B-0626 weights via HuggingFace (~3.2 GB) |
M | Apache-2 |
| O1.2 | Parler-TTS natural language voice description (HuggingFace, Apache-2, Aug 2024) — TTS model that generates speech matching a free-text voice description: "A female speaker delivers animated speech at a moderate pace with very clear audio." Both code and weights are fully permissive Apache-2, trained on 45k hours of audiobook data. Ships Mini (880M, ~2 GB) and Large (2.3B, ~5 GB) variants. 34 named speaker presets for consistent cross-generation voice identity. SDPA + Flash Attention 2 for fast inference; torch.compile compatible. Use case: creating a branded AI narrator by describing the desired voice in plain English, without needing a reference audio clip. POST /audio/speech/parler accepts {text: "...", description: "..."} and returns audio. GET /audio/speech/parler/speakers returns the 34 named preset speakers. |
POST /audio/speech/parler, GET /audio/speech/parler/speakers, GET /audio/speech/parler/info |
core/tts_parler.py |
parler_tts ≥0.1 (Apache-2) — new dep; parler-tts-mini-v1 weights (~2 GB) |
S | Apache-2 | Natural language voice description TTS; closes ElevenLabs voice design gap without API keys |
Goal: Add LTX-Video (LTXV 0.9.8) as a fourth T2V/I2V/V2V engine. LTX-Video is the fastest production-ready DiT-based video model under a fully permissive Apache-2 licence, with unique multi-keyframe conditioning that enables storyboard-to-video workflows not possible with Wan2.2 or CogVideoX.
New required deps: None — LTX-Video uses diffusers (already present)
New routes: ~14
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| O2.1 | LTX-Video LTXV 0.9.8 T2V/I2V (Lightricks, Apache-2, July 2025 for 0.9.8) — fastest Apache-2-licenced DiT video generation model. Key capabilities: (a) up to 60 seconds of video with LTXV-13B (the distilled long-form variant); (b) very low latency — "real-time" for short clips on a 4090; (c) native diffusers support (pip install -e .); (d) T2V, I2V, video extension forward/backward, video-to-video. Ships as /generate/ltxv/t2v and /generate/ltxv/i2v following the same pattern as existing Wan2.2 routes. Backend toggleable via GET /generate/backends. |
POST /generate/ltxv/t2v, POST /generate/ltxv/i2v, GET /generate/ltxv/info |
core/gen_video_ltxv.py |
diffusers ≥0.32 (already present) + LTXV-2B or LTXV-13B weights (~6 GB / ~25 GB) |
M | Apache-2 | Fastest Apache-2 video model; low latency previews before committing to Wan2.2 generation |
| O2.2 | LTX-Video multi-keyframe storyboard-to-video — unique LTXV capability not present in any other Wave model: multi-keyframe conditioning accepts 2+ reference images as anchor frames and generates the video that flows between/through them. Use case: the user uploads a storyboard (sequence of key images) and LTX-Video generates a coherent video that matches each frame at the designated timecode. POST /generate/ltxv/keyframes accepts [{time_sec: N, image_b64: "..."}] and returns a video whose frames match each keyframe at the specified timestamp. Builds on LTX-Video's IC-LoRA control model support. |
POST /generate/ltxv/keyframes, GET /generate/ltxv/keyframes/info |
core/gen_video_ltxv.py (extend) |
diffusers + IC-LoRA control model weights via HuggingFace (Apache-2); no new pip dep |
M | Apache-2 | Storyboard-to-video workflow; closes Runway Act-One / Kling keyframe gap |
| O2.3 | LTX-Video LoRA fine-tuning pipeline — integrates with the LTX-Video-Trainer to expose a POST /train/ltxv/lora endpoint that fine-tunes LTXV on a user-provided set of reference videos (brand style, character consistency, motion style). Same pattern as CogVideoX LoRA (N3.1) but for LTXV's aesthetics. Requires the user to provide 10–30 reference clips; training runs locally in the background as an async job. |
POST /train/ltxv/lora, GET /train/ltxv/lora/status, GET /train/ltxv/lora/list |
core/train_ltxv_lora.py |
ltxv_trainer via pip install git+https://github.com/Lightricks/LTX-Video-Trainer (Apache-2) — new dep |
L | Apache-2 | Brand-style video LoRA training on consumer hardware; closes Runway Explore style fine-tuning gap |
| O2.4 | LTX-Video video extension — LTXV supports forward and backward temporal extension: given an existing video clip, generate N more seconds in the forward direction (extend the story) or backward direction (generate a prequel). Exposed as POST /generate/ltxv/extend with `{video_b64: "...", direction: "forward" |
"backward", duration_sec: N}`. Use case: a short social clip extended into a long-form edit, or a generated clip extended at both ends for B-roll material. | POST /generate/ltxv/extend, GET /generate/ltxv/extend/info |
core/gen_video_ltxv.py (extend) |
diffusers (already present) | S | Apache-2 |
Goal: Add YuE lyrics-to-full-song generation (the only missing music creation primitive now that ACE-Step covers instrumental and DiffRhythm covers sync-to-video); add GGUF quantization support as a cross-cutting platform improvement to make ultra-large models accessible on sub-16 GB hardware.
New required deps: yue-inference (Apache-2, optional heavy dep), llama-cpp-python ≥0.3 (MIT)
New routes: ~10
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| O3.1 | YuE lyrics-to-full-song (HKUST/M-A-P, Apache-2, January 2025) — open-source foundation model for lyrics2song: takes a genre-tag prompt + structured lyrics (with [verse]/[chorus]/[bridge] section labels) and generates a complete multi-minute song with a vocal track and a backing accompaniment track. Key capabilities: (a) multilingual — English, Mandarin, Cantonese, Japanese, Korean; (b) style transfer via ICL (dual-track mode: provide reference vocal + instrumental; model writes a new song in the same style); (c) LoRA fine-tuning for custom artist styles (June 2025); (d) incremental generation — complete a song session-by-session (verse first, then chorus). Hardware: 2 sessions (verse + chorus) fits in 24 GB VRAM. S1 model: YuE-s1-7B-anneal-en-cot (7B LM); S2 model: YuE-s2-1B-general (1B decoder). RT: 30s audio takes ~360s on RTX 4090. Ships as POST /audio/music/yue; integrates with OpenCut's job queue for long-running generation. |
POST /audio/music/yue, POST /audio/music/yue/icl, GET /audio/music/yue/info |
core/music_yue.py |
YuE inference requirements: transformers, flash-attn, xcodec-mini; weights auto-downloaded from HuggingFace (~28 GB total for s1+s2) |
L | Apache-2 | Only local model generating complete vocal+accompaniment songs from lyrics; closes Suno/Udio gap entirely |
| O3.2 | GGUF quantization engine — integrate llama-cpp-python as a CPU/low-VRAM fallback inference engine for any model that has a published GGUF checkpoint on HuggingFace. Primary targets: (a) Qwen2.5-VL (N3.2) — GGUF-Q4_K_M reduces from 16 GB to ~5 GB; (b) YuE-s1-7B GGUF Q4_K_M (community released, when available); (c) future LLM components. Adds ?quant=gguf_q4 param to routes that have GGUF-compatible backends; check_gguf_available() validates the llama-cpp-python package and the specific model GGUF file. Enables CPU-only operation for text-based AI features on machines without discrete GPU. |
GET /system/quantization/gguf/info, GET /system/quantization/gguf/models, POST /system/quantization/gguf/download |
core/gguf_backend.py |
llama-cpp-python ≥0.3 (MIT) — new dep; GGUF model files via HuggingFace |
M | MIT | CPU/low-VRAM fallback for all LLM/VLM components; enables OpenCut on laptops and integrated-graphics machines |
| O3.3 | Multi-GPU task scheduler — route large model inference tasks (Wan2.2 A14B, YuE 7B+1B, Mochi-1, CogVideoX1.5-5B 20 GB) to a multi-GPU pool automatically. Detects available GPU topology via torch.cuda.device_count() and tensor-parallel hints, maps models to device groups, and serialises requests through the existing @async_job queue to prevent GPU contention. Exposes GET /system/gpu/topology and GET /system/gpu/allocation for the UXP panel's resource monitor. Required prerequisite for future 70B+ model support and for running YuE full-song (4+ sessions requiring 80 GB) on dual A100 workstations. |
GET /system/gpu/topology, GET /system/gpu/allocation, POST /system/gpu/prefer |
core/gpu_scheduler.py |
torch (already present); no new dep |
M | N/A (platform code) | Enables large model use on multi-GPU workstations; prerequisite for 70B+ future models |
| Tool | Org | Licence | Category | Wave O Role |
|---|---|---|---|---|
| Dia 1.6B | Nari Labs | Apache-2 | Dialogue TTS | O1.1 — Two-speaker scripted dialogue with nonverbals |
| Parler-TTS | HuggingFace | Apache-2 | TTS | O1.2 — Natural language voice description synthesis |
| LTX-Video 0.9.8 | Lightricks | Apache-2 | T2V/I2V | O2.1–O2.4 — Fast multi-keyframe; up to 60s; LoRA |
| YuE | HKUST/M-A-P | Apache-2 | Music Gen | O3.1 — Lyrics-to-full-song; vocal + backing track |
| LTX-2 | Lightricks | LTX-2 Community Licence | T2V (audio+video) | NOT ADOPTED — $10M revenue threshold for commercial |
| HunyuanVideo | Tencent-Hunyuan | Tencent Hunyuan Community Licence | T2V | NOT ADOPTED — EU/UK/South Korea excluded; geographic restrictions |
| F5-TTS | SWivid/X-LANCE | MIT code + CC-BY-NC weights | TTS | NOT ADOPTED — weights non-commercial (Emilia dataset) |
| Mochi-1 | Genmo | Apache-2 | T2V | Watch list — Apache-2 ✓ but single-GPU requires 60 GB VRAM; Wave P |
| Stable Audio Open | StabilityAI | Code MIT; model weights SA-CL (non-commercial) | Music Gen | NOT ADOPTED — weights non-commercial |
| Dia2 | Nari Labs | TBD | Dialogue TTS | Watch list — Dia2 released Nov 2025; licence audit required for Wave P |
| Gap | Competitor | Wave O Feature | Closes? |
|---|---|---|---|
| Two-speaker scripted dialogue audio | Descript Overdub, ElevenLabs Studio | O1.1 Dia 1.6B | Y — nonverbal sounds + voice clone locally |
| Natural language voice design (no reference audio) | ElevenLabs Voice Design | O1.2 Parler-TTS | Y — free-text describes any voice |
| Fast preview T2V before committing to full render | — | O2.1 LTX-Video T2V | Y — near-real-time previews on 4090 |
| Storyboard-to-video (multi keyframe) | Runway Act-One, Kling Keyframe | O2.2 LTX-Video keyframes | Y — image sequence → coherent video |
| Video temporal extension | Runway Extend Clip | O2.4 LTX-Video extend | Y — forward + backward extension locally |
| Lyrics-to-full-song with vocals | Suno AI, Udio | O3.1 YuE | Y — open-weights, commercial OK, Apache-2 |
| CPU/laptop GPU model inference | Cloud-only TTS/VLM services | O3.2 GGUF engine | Y — Q4_K_M reduces 16 GB → 5 GB for VLM features |
- Dia English-only: Dia 1.6B and Dia2 are trained primarily on English. Other languages may produce incorrect pronunciation or degraded quality. Document this in the
GET /audio/speech/dia/inforesponse aslanguages: ["en"]and link to Chatterbox-Multilingual (M1.1) for non-English use. - Dia nonverbal vocabulary: Only the listed nonverbals produce reliable output; using arbitrary descriptors causes audio artifacts. Hard-code the validated list in
tts_dia.pyand expose it viaGET /audio/speech/dia/nonverbals. Reject or warn on unlisted tags. - Parler-TTS voice consistency: Random-prompt mode produces a different voice every generation; use a named speaker from the 34 presets to get consistent cross-generation identity. Document this in the UI with a speaker picker.
- Parler-TTS prompt engineering: "very clear audio" produces the highest quality output; "very noisy audio" produces noise. Expose a quality slider in the UI that appends the appropriate phrase automatically rather than expecting users to know this.
- LTX-Video LTXV-13B weight size: 25 GB weight file. Follow the same pre-flight pattern as FLUX Kontext:
{available: false, reason: "weights_not_downloaded", size_gb: 25}. Default to LTXV-2B (~6 GB) for standard clips; LTXV-13B only for 60s long-form jobs. - LTX-2 NOT adopted: LTX-2 (January 2026) is the successor to LTX-Video and adds synchronized audio+video generation, but ships under a custom community licence that requires a paid commercial agreement for entities with ≥$10M revenue. OpenCut cannot bundle LTX-2 for commercial use without guaranteeing every user meets the licence terms. Monitor for an Apache-2 re-release or equivalent permissive re-licensing before Wave P.
- HunyuanVideo geographic restriction: The Tencent Hunyuan Community Licence explicitly excludes EU, UK, and South Korea from the Territory. OpenCut ships globally including to EU/EEA users; bundling HunyuanVideo would mean European users are unintentionally unlicensed. Do not adopt under any circumstances until Tencent changes the licence.
- F5-TTS CC-BY-NC weights: The MIT code licence covers the inference framework, but the pre-trained weights were trained on the Emilia dataset under CC-BY-NC 4.0. The weights themselves are therefore non-commercial. Do not adopt. Note: if a user brings their own F5-TTS weights trained on permissively-licensed data, OpenCut can support the F5-TTS inference engine as a community plugin (via the Wave M plugin marketplace) without bundling the non-commercial weights.
- YuE VRAM: Full 4-session song (verse + pre-chorus + chorus + outro) requires 80 GB+ VRAM for parallel session generation.
check_yue_available()must detect VRAM and cap--run_n_segmentsto 2 for cards with <24 GB, with a user notification. For multi-GPU setups (O3.3), YuE can be tensor-paralleled across 2× 40 GB cards. - YuE generation latency: 30s of audio takes ~360s on an RTX 4090. For a full 4-session song that's ~24 minutes. Always run
POST /audio/music/yueas a fully async background job with an SSE status stream. Disable the "Cancel" button mid-session to avoid partial-write corruption in the YuE xcodec_mini decoder. - GGUF weight availability: Not all model architectures have community GGUF checkpoints yet.
core/gguf_backend.pymust handle a missing GGUF gracefully: return{gguf_available: false, reason: "no_gguf_checkpoint", alternatives: ["download_full_weights"]}rather than a 503. Track GGUF checkpoint status per model in agguf_manifest.json. - Stable Audio Open weights: The Stability AI
stable-audio-open-1.0model code is MIT but the weights are under the Stability AI Community Licence which prohibits commercial use. Do not adopt. ACE-Step (L2.2) covers the ambient/sfx music generation use case under Apache-2.
| Release | ETA | Features |
|---|---|---|
| v1.42.0 | 2027-Q3 | O1.1 Dia, O1.2 Parler-TTS |
| v1.43.0 | 2027-Q3 | O2.1 LTX-Video T2V/I2V, O2.2 multi-keyframe, O2.3 LoRA, O2.4 video extend |
| v1.44.0 | 2027-Q4 | O3.1 YuE, O3.2 GGUF engine, O3.3 multi-GPU scheduler |
- LTX-2 (Lightricks, Jan 2026) — Custom LTX-2 Community Licence; entities with annual revenues ≥$10M must obtain a separate paid commercial licence from Lightricks. OpenCut cannot verify revenue thresholds for individual users; this creates an unmanageable compliance burden. Monitor for Apache-2 re-release or equivalent. LTX-Video (0.9.8, Apache-2) adopted in its place.
- HunyuanVideo (Tencent-Hunyuan, Dec 2024) — Tencent Hunyuan Community Licence explicitly excludes the EU, UK, and South Korea. OpenCut cannot ship a model restricted by user geography. Hard pass unless Tencent releases under Apache-2.
- F5-TTS (SWivid/X-LANCE, 2024) — MIT inference code, CC-BY-NC 4.0 model weights (Emilia training data). The pre-trained weights cannot be used commercially. The F5-TTS architecture can be supported as a plugin inference engine for user-supplied custom weights via the Wave M plugin marketplace — document this in the plugin SDK.
- Mochi-1 (Genmo, Nov 2024) — Apache-2 ✓ but requires approximately 60 GB VRAM for single-GPU inference in this repository. ComfyUI can reduce this to ~20 GB but that requires a different integration path. Watch list for Wave P pending a consumer-VRAM-optimised inference implementation (target: <16 GB with quantization).
- Stable Audio Open (StabilityAI, 2024) — Stability AI Community Licence (non-commercial for model weights). NOT viable. ACE-Step (L2.2) + DiffRhythm (M1.3) + YuE (O3.1) collectively cover all music generation use cases.
- Dia2 (Nari Labs, Nov 2025) — Successor to Dia with unspecified improvements. Licence not yet audited at time of this survey. Dia 1.6B adopted; Dia2 added to watch list for Wave P after licence confirmation.
- YuE full-song on consumer hardware — 4+ session full song requires 80 GB VRAM. Deferred pending GGUF Q4 community release for YuE-s1-7B (GitHub issue open:
multimodal-art-projection/YuE#467). Once GGUF is available, O3.2 GGUF engine can serve as the fallback.
- Dia 1.6B — nari-labs/dia (Apache-2, April 2025); Dia2: nari-labs/dia2 (Nov 2025); HF:
nari-labs/Dia-1.6B-0626; HF Transformers native (DiaForConditionalGeneration) - Parler-TTS — huggingface/parler-tts (Apache-2, Aug 2024); Mini 880M + Large 2.3B; arXiv:2402.01912 (Lyth & King, Stability AI);
pip install git+https://github.com/huggingface/parler-tts.git - LTX-Video — Lightricks/LTX-Video (Apache-2); LTXV 0.9.8 (July 2025): up to 60s, LTXV-13B; arXiv:2501.00103; diffusers native; LTX-Video-Trainer for LoRA
- LTX-2 — Lightricks/LTX-2 (LTX-2 Community Licence, Jan 2026); audio+video; 4K/50FPS; NOT adopted — commercial revenue threshold restriction
- YuE — multimodal-art-projection/YuE (Apache-2, Jan 2025; Apache-2 relicensing confirmed Jan 30 2025); arXiv:2503.08638; YuE-s1-7B + YuE-s2-1B; multilingual; LoRA fine-tuning (June 2025)
- HunyuanVideo — Tencent-Hunyuan/HunyuanVideo (Tencent Hunyuan Community Licence, Dec 2024); geographic restrictions — NOT adopted
- F5-TTS — SWivid/F5-TTS (MIT code, CC-BY-NC 4.0 weights); arXiv:2410.06885; weights non-commercial — NOT adopted
- Mochi-1 — genmoai/mochi (Apache-2, Nov 2024); 10B AsymmDiT; 60 GB VRAM single-GPU; watch list
- Stable Audio Open — (Stability AI, 2024); code MIT, weights Stability AI Community Licence (non-commercial) — NOT adopted
Updated: 2026-04-24 Baseline: v1.44.0 (post-Wave O; Dia, Parler-TTS, LTX-Video, YuE, GGUF engine, multi-GPU scheduler all shipped) Research pass: April 2026 GitHub OSS survey — ConsisID, Allegro, HiDream-I1/E1, CogView4, Open-Sora 2.0, Qwen2.5-Omni, DepthCrafter, VoiceCraft, Open-Sora-Plan
This wave covers three distinct capabilities:
- Video generation with identity consistency (a person's face held constant across the entire video from a single reference photo) and a new lowest-VRAM T2V option (9.3 GB) that unlocks consumer-grade GPUs for the first time.
- A SOTA text-to-image upgrade — HiDream-I1 (17B Sparse DiT) and its companion instruction-based editor HiDream-E1, plus CogView4-6B as the first bilingual (English + Chinese) T2I model in the stack.
- Multimodal video intelligence — Qwen2.5-Omni processes video frames + audio simultaneously and generates spoken narration audio as output, enabling automated video commentary and analysis workflows; and Open-Sora 2.0 as an 11B high-quality T2V completing the video generation tier.
Goal: Add two video generation models addressing gaps not covered by any prior wave: (a) identity-preserving T2V where a specific person (from a reference face photo) appears consistently throughout the generated video — critical for brand ambassador, tutorial, and social content workflows; (b) a lightweight T2V at 9.3 GB VRAM, the lowest of any model in the stack, making T2V accessible to users with 12 GB cards for the first time.
New required deps: None — both models use diffusers (already present) ≥0.33
New routes: ~14
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| P1.1 | ConsisID identity-preserving T2V (PKU-YuanGroup, Apache-2, CVPR 2025 Highlight) — tuning-free DiT-based identity-preserving text-to-video model built on CogVideoX-5B infrastructure. Accepts a face reference image (a photo of any person) and generates a video in which that person's face, identity, and appearance remain consistent across every frame throughout the clip. Uses a frequency decomposition approach to decouple identity features from motion, so the model generates natural dynamic motion while locking the face. Ships as POST /generate/consisid/t2v with {prompt: "...", face_image_b64: "..."}. Hardware: ~18 GB VRAM (same as CogVideoX-5B); diffusers 0.33.0+ native. Use case: create a video where a specific person is the subject (brand spokesperson, presenter, social avatar, training character) without any actor or green screen. |
POST /generate/consisid/t2v, POST /generate/consisid/preview, GET /generate/consisid/info |
core/gen_video_consisid.py |
diffusers ≥0.33 (already present) + ConsisID weights (~18 GB) via HuggingFace |
M | Apache-2 | Identity-preserving T2V; closes Runway "reference person" gap locally |
| P1.2 | Allegro lightweight T2V + TI2V (rhymes-ai, Apache-2, Oct 2024) — 2.8B DiT T2V model; the lowest-VRAM production T2V in the stack at 9.3 GB with --enable_cpu_offload (vs ~18 GB for Wan2.2/CogVideoX and ~6 GB for LTX-Video). Generates 6-second, 720×1280 video at 15 FPS. Allegro-TI2V variant adds first-frame + optional last-frame image conditioning for first-and-last-frame interpolation workflows. Both variants are in diffusers 0.32.0+. A Presto fine-tune (rhymes-ai) extends clips to longer durations. Use case: users with 12 GB VRAM cards can run T2V for the first time; first-and-last frame interpolation is unique in the stack. Ships as /generate/allegro/t2v and /generate/allegro/ti2v. |
POST /generate/allegro/t2v, POST /generate/allegro/ti2v, GET /generate/allegro/info |
core/gen_video_allegro.py |
diffusers ≥0.32 (already present) + Allegro-T2V and Allegro-TI2V weights (~5 GB each) via HuggingFace |
M | Apache-2 | Lowest-VRAM T2V (9.3 GB); first-and-last-frame interpolation; unlocks 12 GB GPU users |
Goal: Elevate the text-to-image tier with models that exceed FLUX.1-dev quality. HiDream-I1 is the new SOTA T2I at 17B parameters; HiDream-E1 adds natural-language instruction-based editing ("make the car red", "remove the person"). CogView4-6B adds the first Chinese-language T2I capability in the stack.
New required deps: HiDream-I1 requires meta-llama/Meta-Llama-3.1-8B-Instruct (same Meta Community Licence handling as CSM-1B — user opt-in gate)
New routes: ~16
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| P2.1 | HiDream-I1 SOTA text-to-image (HiDream.ai, MIT, April 2025) — 17B Sparse Diffusion Transformer text-to-image model that achieves the highest scores on DPG-Bench (85.89 overall), GenEval (0.83), and HPSv2.1 (33.82) among all open models — surpassing FLUX.1-dev, SD3-Medium, DALL-E 3, and Janus-Pro-7B. Ships in three inference-speed variants: Full (50 steps), Dev (28 steps), Fast (16 steps). Diffusers HiDreamImagePipeline. Requires meta-llama/Meta-Llama-3.1-8B-Instruct (~16 GB) as the text encoder backbone — same gate as CSM-1B; expose via the existing llama_ack user consent mechanism in opencut/config.json. Combined weight footprint: ~33 GB (17B DiT + Llama-3.1-8B). Use Fast variant as the default; Full for highest quality. |
POST /image/generate/hidream, GET /image/generate/hidream/info, GET /image/generate/hidream/variants |
core/t2i_hidream.py |
diffusers ≥0.32 (already present) + HiDream-I1 weights (~17 GB) + meta-llama/Meta-Llama-3.1-8B-Instruct (Meta Community Licence, gated HF; user opt-in) |
M | MIT (code + HiDream weights); Meta Community Licence (Llama backbone, gated) | SOTA T2I quality; closes Midjourney generation quality gap; surpasses FLUX.1-dev on all benchmarks |
| P2.2 | HiDream-E1 instruction image editing (HiDream.ai, MIT, July 2025) — instruction-based image editing companion to HiDream-I1: accepts an image + a natural language edit instruction and returns the edited image. Capabilities: style transfer, object addition/removal, color change, attribute modification, background swap. Reuses the HiDream-I1 weights + Llama backbone; no additional weight download once I1 is installed. Ships as POST /image/edit/hidream with {image_b64: "...", instruction: "make the car bright red"}. Use case: per-frame AI editing on video stills (combined with TokenFlow for video propagation, same pipeline as FLUX Kontext P2 from Wave M). |
POST /image/edit/hidream, POST /image/edit/hidream/batch, GET /image/edit/hidream/info |
core/img_edit_hidream.py |
Same as P2.1 (no additional deps) + HiDream-E1 weights (~17 GB, separate model from I1) | S | MIT | Natural language image editing; complements FLUX Kontext (Wave M2.4); enables instruction-based per-frame editing workflow |
| P2.3 | CogView4-6B bilingual text-to-image (ZhipuAI/THUDM, Apache-2, March 2025) — 6B parameter text-to-image DiT with native bilingual support (English + simplified Chinese input). At 13 GB VRAM with CPU offload and int4 text encoder, it is the lightest full-quality T2I in the stack. Ships CogView4Pipeline in diffusers. Competitive with FLUX.1-dev on DPG-Bench (85.13 vs 83.79) and strong Chinese-language text accuracy. Unique value over HiDream-I1: (a) no gated Llama dependency, (b) 13 GB VRAM instead of 33 GB, (c) Chinese-language input. Fine-tunable with CogKit or finetrainers on a single 4090. |
POST /image/generate/cogview4, GET /image/generate/cogview4/info |
core/t2i_cogview4.py |
diffusers ≥0.32 (already present) + CogView4-6B weights (~12 GB) |
S | Apache-2 | Bilingual T2I (English + Chinese); 13 GB VRAM; no gated deps; competes with FLUX.1-dev quality |
Goal: Add Qwen2.5-Omni for the first model that can both understand video (watching + listening) and generate audio narration as output simultaneously; and add Open-Sora 2.0 as the highest-quality Apache-2 T2V for creators who prioritise quality over hardware constraints.
New required deps: transformers update for Qwen2.5-Omni Qwen2_5OmniModel support
New routes: ~10
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| P3.1 | Qwen2.5-Omni multimodal video narrator (Alibaba Cloud, Apache-2, March 2025) — end-to-end multimodal model that accepts any combination of text, audio, images, and video clips as input and generates both text and natural speech audio as output. Unique OpenCut integration: POST /analyze/video/narrate accepts a video clip and a narration style (e.g. "documentary", "sports commentary", "educational explainer") and returns (a) a written commentary script, (b) a generated speech audio file voiced by Qwen2.5-Omni's Talker module. Also enables POST /analyze/video/qa for free-form questions about video content ("what is happening in this clip?" → spoken + written answer). Makes OpenCut the first local video editor with an AI that can watch a clip and narrate it. |
POST /analyze/video/narrate, POST /analyze/video/qa, GET /analyze/video/omni/info |
core/multimodal_omni.py |
transformers update for Qwen2_5OmniModel + Qwen2.5-Omni-7B weights (~14 GB, Apache-2) via HuggingFace |
M | Apache-2 | First model that watches+listens AND narrates video; unique in video editor ecosystem; closes Adobe Firefly audio-to-video description gap |
| P3.2 | Open-Sora 2.0 high-quality T2V (hpcaitech, Apache-2, March 2025) — 11B T2V model (the largest Apache-2 T2V in the stack) that benchmarks on-par with HunyuanVideo 11B and Step-Video 30B on VBench + Human Preference. Generates 5-second 720×1280 video. Training code and all checkpoints are fully open-source ($200K training cost documented). Ships as /generate/opensora2/t2v — the quality-tier T2V for users with >18 GB VRAM who want the best possible output. Also includes Open-Sora 1.3 (1B) as a lightweight variant for users with <12 GB VRAM; both models share the same backend. |
POST /generate/opensora2/t2v, POST /generate/opensora2/t2v/1b, GET /generate/opensora2/info |
core/gen_video_opensora2.py |
diffusers or Open-Sora native inference code (Apache-2); 11B weights (~22 GB); 1B weights (~3 GB) |
L | Apache-2 | Highest-quality Apache-2 T2V available; equals HunyuanVideo quality without geographic licence restrictions |
| P3.3 | UXP panel v1.0 milestone — complete the CEP → UXP transition targeting full parity on all Wave L through P features in the UXP panel. CEP reaches end-of-life in September 2026 with Adobe Premiere Pro; all users will need the UXP panel before that date. Milestone checklist: (a) all async job routes render progress in the UXP job queue panel; (b) GPU topology + GGUF model status exposed in UXP resource monitor; (c) all video preview/export buttons functional; (d) all audio/TTS playback controls in UXP. CEP panel moves to "legacy support" status (security fixes only) after this release. | — (panel-only work, no new backend routes) |
UXP panel codebase (panel-uxp/) |
No new Python deps | L | N/A (platform code) | CEP EOL Sept 2026; all users must migrate; this is the last major UXP gap-fill before EOL |
| Tool | Org | Licence | Category | Wave P Role |
|---|---|---|---|---|
| ConsisID | PKU-YuanGroup | Apache-2 | Identity T2V | P1.1 — Face reference → consistent-identity T2V |
| Allegro | rhymes-ai | Apache-2 | T2V + TI2V | P1.2 — 9.3 GB VRAM T2V; first+last frame conditioning |
| HiDream-I1 | HiDream.ai | MIT | T2I | P2.1 — SOTA 17B T2I; surpasses FLUX.1-dev on all benchmarks |
| HiDream-E1 | HiDream.ai | MIT | Image Editing | P2.2 — Instruction-based image editing companion to I1 |
| CogView4-6B | ZhipuAI/THUDM | Apache-2 | T2I | P2.3 — Bilingual (EN+ZH) T2I; 13 GB VRAM; no gated deps |
| Qwen2.5-Omni | Alibaba Cloud | Apache-2 | Multimodal | P3.1 — Video understanding + audio narration generation |
| Open-Sora 2.0 | hpcaitech | Apache-2 | T2V | P3.2 — 11B SOTA quality T2V; equals HunyuanVideo quality |
| DepthCrafter | Tencent | Custom (non-commercial) | Depth Estimation | NOT ADOPTED — academic/research/education only |
| VoiceCraft | jasonppy | CC BY-NC-SA 4.0 code + Coqui model | TTS/Speech Edit | NOT ADOPTED — both code and weights non-commercial |
| Open-Sora-Plan | PKU-YuanGroup | MIT | T2V | NOT ADOPTED — overlapping with Open-Sora 2.0 + ConsisID |
| Mochi-1 | Genmo | Apache-2 | T2V | Watch list (carry-over from Wave O) — 60 GB VRAM |
| Gap | Competitor | Wave P Feature | Closes? |
|---|---|---|---|
| Person-consistent video from reference photo | Runway Act-One, Kling AI Face Reference | P1.1 ConsisID | Y — face photo → identity-locked T2V locally |
| First-and-last-frame video interpolation | Runway, Kling Interpolation | P1.2 Allegro TI2V | Y — first + last frame → coherent transition video |
| T2V for 12 GB VRAM consumer GPUs | Cloud-only services | P1.2 Allegro (9.3 GB) | Y — RTX 3060 Ti / 4060 Ti now capable of T2V |
| SOTA T2I quality beyond FLUX.1 | Midjourney V7, Ideogram 3 | P2.1 HiDream-I1 | Y — best GenEval + DPG scores of any open model |
| Natural language image editing | Adobe Firefly, Photoshop Generative Fill | P2.2 HiDream-E1 | Y — instruction-based editing of any image locally |
| Chinese-language text-to-image | Cloud services (Tongyi Wanxiang) | P2.3 CogView4-6B | Y — native Chinese T2I; first in OpenCut stack |
| Video QA + spoken narration generation | OpenAI o1 Vision API | P3.1 Qwen2.5-Omni | Y — watch video + generate narration audio locally |
| SOTA T2V quality without HunyuanVideo licence | HunyuanVideo (restricted) | P3.2 Open-Sora 2.0 | Y — equal quality; Apache-2; no geographic restrictions |
- ConsisID face quality dependency: ConsisID performs best with a high-quality, front-facing face photograph at 512×512 or larger. Low-resolution or side-profile inputs reduce identity consistency significantly. Add a pre-flight face detection check using the existing SAM2 (N2.1) face detection or a lightweight
face_recognitioncheck; reject inputs with no detected face rather than silently producing poor results. - ConsisID single-person only: The current model is trained for single-identity generation. Multi-person identity preservation is not yet reliable. Document this clearly in
GET /generate/consisid/inforesponse asmax_identities: 1. Monitor ConsisID for multi-identity updates in Wave Q. - Allegro 15 FPS output: Allegro generates at 15 FPS natively. For 30 FPS delivery, use EMA-VFI (a frame interpolation model) as a post-processing step. Add an
--interpolate_fps: 30option to/generate/allegro/t2vthat pipelines EMA-VFI automatically. Note: EMA-VFI licence must be checked before bundling — it is MIT, so this is safe. - HiDream-I1 Llama-3.1 gate: Same situation as CSM-1B (Wave N3.3). The Meta Llama 3.1 Community Licence requires a HuggingFace access token and
meta-llama/Meta-Llama-3.1-8B-Instructmodel access. Use the samellama_ackmechanism already inopencut/config.json— if the user has already accepted for CSM-1B, reuse that acceptance for HiDream-I1. If not accepted,GET /image/generate/hidream/inforeturns{available: false, reason: "llama_gate_required", gate_url: "https://hf.co/meta-llama/Meta-Llama-3.1-8B-Instruct"}. - HiDream-I1 weight size: 17B Sparse DiT (~17 GB) + Llama-3.1-8B (~16 GB) = ~33 GB combined. If the user already has Llama-3.1-8B installed for CSM-1B or YuE, avoid re-downloading: detect via
HF_HOMEscan formeta-llama/Meta-Llama-3.1-8B-Instruct. In the pre-flight check, state{llama_already_cached: true, total_new_download_gb: 17}. - CogView4 prompt style: CogView4 was trained on long synthetic descriptions; short prompts produce mediocre results. Add an automatic prompt expansion pass using Qwen2.5-VL (N3.2, already in the stack) before passing to CogView4; expose as an optional
?expand_prompt=trueparameter. - Qwen2.5-Omni audio output rate: The Talker module generates audio in a streaming fashion — expose it via the existing SSE streaming infrastructure so the first audio chunk plays in the UXP panel before generation is complete. This gives a "real-time narration" feel for short clips.
- Open-Sora 2.0 11B VRAM: 11B T2V requires approximately 22 GB VRAM. This is the highest hardware requirement of any Wave P model. Gate behind
check_opensora2_available()that returns{available: false, reason: "insufficient_vram", required_gb: 22, detected_gb: N}for cards below 20 GB. Open-Sora 1.3 (1B) automatically becomes the default fallback. - DepthCrafter non-commercial: DepthCrafter (Tencent) provides temporal depth estimation for video sequences — highly useful for depth-based video compositing. However, its licence explicitly restricts use to academic, research, and education purposes. Do not adopt. For depth estimation on static frames, Depth-Anything-V2 (Wave N2.1, Apache-2) already covers the use case. Monitor for a permissive re-release.
| Release | ETA | Features |
|---|---|---|
| v1.45.0 | 2027-Q4 | P1.1 ConsisID, P1.2 Allegro T2V+TI2V |
| v1.46.0 | 2028-Q1 | P2.1 HiDream-I1, P2.2 HiDream-E1, P2.3 CogView4-6B |
| v1.47.0 | 2028-Q1 | P3.1 Qwen2.5-Omni, P3.2 Open-Sora 2.0, P3.3 UXP panel v1.0 |
- DepthCrafter (Tencent, 2024) — custom Tencent licence; explicitly non-commercial ("only for academic, research and education purposes"). NOT viable for OpenCut. The depth-estimation use case is covered by Depth-Anything-V2 (Wave N2.1, Apache-2) for static frames. Monitor for a permissive re-licence.
- VoiceCraft (jasonppy, 2024) — CC BY-NC-SA 4.0 code AND Coqui Public Model Licence weights; both code and model are non-commercial. NOT adopted. The zero-shot speech editing use case (edit a spoken word in an existing audio clip without re-recording) has no current Apache-2 equivalent in the stack; add to watch list for Wave Q.
- Open-Sora-Plan (PKU-YuanGroup, 2024) — MIT ✓ for code, but the feature set is now superseded by Open-Sora 2.0 (better quality, same licence category) and ConsisID (identity-preserving variant, same lab). Not adopted to avoid a redundant video generation backend.
- Mochi-1 (Genmo, Nov 2024) — Apache-2 ✓ (carry-over from Wave O watch list). 60 GB single-GPU VRAM requirement remains a hard blocker. With the O3.3 multi-GPU scheduler in place, evaluate for Wave Q using 2× A100 or 4× 3090 multi-GPU inference.
- Step-Video (Kuaishou, 2025) — Apache-2 for code; 30B model. Could not confirm a clean Apache-2 checkpoint for the full model weights. Monitor for HuggingFace release with confirmed permissive weights licence.
- SkyReels (ByteDance, 2025) — repository not found at anticipated locations under
bytedance/org. Monitor; ByteDance has a pattern of releasing strong T2V models with restrictive commercial licences. - Zero-shot speech editing (in-audio word replacement) — VoiceCraft is the best-known OSS approach (non-commercial ❌). Examine
audiocraft/audiogen(already excluded for weights) and community re-training efforts. The use case is high-value for podcast editing but has no Apache-2/MIT viable model in 2026. Reserve as Wave Q P0 if a permissive model emerges.
- ConsisID — PKU-YuanGroup/ConsisID (Apache-2, Nov 2024); CVPR 2025 Highlight; arXiv:2411.17440; diffusers 0.33.0+; Windows one-click installer by community
- Allegro — rhymes-ai/Allegro (Apache-2, Oct 2024); 2.8B DiT; Allegro-TI2V for first/last-frame conditioning; arXiv:2410.15458; diffusers 0.32.0+
- HiDream-I1 — HiDream-ai/HiDream-I1 (MIT, April 2025); 17B Sparse DiT; arXiv:2505.22705; Full/Dev/Fast variants; requires
meta-llama/Meta-Llama-3.1-8B-Instruct(gated) - HiDream-E1 — HiDream-ai/HiDream-E1 (MIT, July 2025); instruction-based image editing; companion to HiDream-I1; HuggingFace Space demo available
- CogView4-6B — THUDM/CogView4 (Apache-2, March 2025); 6B parameters; bilingual English + Chinese;
CogView4Pipelinein diffusers; 13 GB VRAM with int4 text encoder; CogKit fine-tuning support - Qwen2.5-Omni — QwenLM/Qwen2.5-Omni (Apache-2, March 2025); Thinker + Talker dual-module architecture; video + audio + text in; text + audio out;
Qwen2_5OmniModelin transformers - Open-Sora 2.0 — hpcaitech/Open-Sora (Apache-2); v2.0 (11B, March 2025); arXiv:2503.09642; 5s 720×1280; on-par with HunyuanVideo 11B + Step-Video 30B; Open-Sora 1.3 (1B) for consumer hardware
- DepthCrafter — Tencent/DepthCrafter (Custom non-commercial, 2024); temporal video depth estimation; NOT adopted — academic/research/education only
- VoiceCraft — jasonppy/VoiceCraft (CC BY-NC-SA 4.0 code + Coqui model licence, 2024); zero-shot speech editing and TTS; NOT adopted — non-commercial
Updated: 2026-04-24 Baseline: v1.47.0 (post-Wave P; UXP panel v1.0, Open-Sora 2.0, Qwen2.5-Omni, CogView4-6B, HiDream-I1/E1, ConsisID, Allegro all shipped) Research pass: April 2026 GitHub OSS survey — VACE, CosyVoice 2.0, MaskGCT (Amphion), Vevo (Amphion), OmniGen2, SkyReels V2, SkyReels V3, IndexTTS2, FireRedTTS, Stable Virtual Camera
This wave introduces three capability clusters:
- VACE all-in-one video compositing — the single most impactful video editing feature missing from OpenCut: compose, move, swap, reference, expand, and animate elements inside an existing video using a mask + prompt. Uses Wan2.1-VACE backend. Closes Adobe Firefly's video compositing feature gap.
- Voice generation upgrade — replaces and supplements the Wave L/M/N TTS tier: CosyVoice 2.0 adds 9-language + 18 Chinese dialect zero-shot voice cloning at 150 ms streaming latency; MaskGCT from the Amphion toolkit adds a parallel (non-autoregressive) TTS path for the fastest inference speed of any model in the stack.
- Multi-reference image synthesis + infinite-length video — OmniGen2 closes the "combine multiple reference images into one coherent output" gap (Kling/Runway's "actor swap" workflow); SkyReels V2 closes the infinite-length temporal video generation gap using Diffusion Forcing autoregressive architecture.
Goal: Expose the full Wan2.1-VACE compositing capability as a first-class feature in OpenCut: mask-based inpainting, subject replacement, scene expansion, reference-object insertion, depth-based re-animation, and background swap. One model handles all six compositing workflows.
New required deps: wan (Wan2.1 native inference — git-installable, Apache-2; or diffusers if VACE is integrated there)
New routes: ~18
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| Q1.1 | VACE all-in-one video compositing (ali-vilab/VACE, Apache-2, ICCV 2025) — a single model (Wan2.1-VACE-14B or 1.3B) that performs all of: (a) V2V — style + appearance editing of an existing video via text prompt; (b) MV2V (masked V2V) — mask + prompt to edit a specific region while keeping the rest intact; (c) R2V (reference-to-video) — insert a specific object or person from a reference image into a video. Named editing workflows all built on these three modes: Move-Anything (mask subject → new position), Swap-Anything (mask subject → new subject from text/image), Reference-Anything (insert a reference image object into a video), Expand-Anything (extend video spatial canvas outward), Animate-Anything (give motion to a still object in-place). Ship as a compositing tool in the OpenCut timeline: user paints a mask on a clip, picks a workflow, enters a prompt or reference image, and VACE renders a new video segment. VACE-Wan2.1-1.3B (480×832) as default; Wan2.1-VACE-14B (720×1280) as quality mode. | POST /compose/vace/v2v, POST /compose/vace/mv2v, POST /compose/vace/r2v, GET /compose/vace/preprocess/{task}, GET /compose/vace/info |
core/video_compose_vace.py + core/vace_preprocess.py |
pip install wan@git+https://github.com/Wan-Video/Wan2.1 (Apache-2) + VACE-Wan2.1-1.3B weights (~5 GB) + optional Wan2.1-VACE-14B (~28 GB) |
L | Apache-2 | Closes entire Adobe Firefly video compositing gap locally; most compositing workflows require no extra model beyond one VACE checkpoint |
| Q1.2 | VACE preprocessor toolkit — VACE's V2V/MV2V tasks require mask + depth/flow/edge preprocessors. Wrap VACE-Annotators (ali-vilab, Apache-2) as a pre-flight step: auto-detect the task type and run the appropriate annotator (depth map for depth task, optical flow for motion task, edge map for edge task) before calling the VACE inference backend. Expose as a preprocess step in the compositing UI so the user sees a preview of the extracted map before sending to VACE. |
POST /compose/vace/preprocess/depth, POST /compose/vace/preprocess/flow, POST /compose/vace/preprocess/edge |
core/vace_preprocess.py |
VACE-Annotators (~2 GB) — auto-downloaded from HuggingFace ali-vilab/VACE-Annotators |
S | Apache-2 | Required preprocessing for VACE depth/flow/edge modes; without this, those modes require manual map generation |
Goal: Upgrade the TTS/voice-cloning tier with two high-quality models that together cover 9 languages + 18 Chinese dialects (CosyVoice 2.0) and the fastest parallel TTS inference path of any model in the stack (MaskGCT).
New required deps: CosyVoice 2.0 requires WeTextProcessing; MaskGCT requires phonemizer + espeak-ng
New routes: ~12
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| Q2.1 | CosyVoice 2.0 / Fun-CosyVoice 3.0 multilingual TTS (FunAudioLLM/CosyVoice, Apache-2, December 2024 / December 2025) — 0.5B multilingual TTS model covering 9 languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian) and 18+ Chinese dialects/accents (Cantonese, Sichuan, Shanghainese, Minnan, and more) with zero-shot voice cloning. Key features over existing Wave TTS models: (a) bi-streaming: text-in streaming + audio-out streaming with 150 ms end-to-end latency — ship as an SSE streaming endpoint so the first audio chunk plays before generation completes; (b) instruction-based generation: {text: "Hello.", instruction: "speak with a warm, friendly smile"} controls prosody, speed, volume, emotion and dialect; (c) zero-shot voice cloning with a 3-second reference audio sample. Fun-CosyVoice 3.0 (December 2025) extends with GRPO reinforcement learning training and improved content accuracy and speaker similarity scores. |
POST /tts/cosyvoice, POST /tts/cosyvoice/stream (SSE), POST /tts/cosyvoice/clone, GET /tts/cosyvoice/info |
core/tts_cosyvoice.py |
CosyVoice2 repo (pip install -r requirements.txt) + CosyVoice2-0.5B weights (~1 GB) via HuggingFace FunAudioLLM/CosyVoice2-0.5B |
M | Apache-2 | 9-language + 18-dialect coverage; fastest TTS streaming (150ms); unique Chinese dialect control; closes multilingual dubbing gap |
| Q2.2 | MaskGCT zero-shot parallel TTS (open-mmlab/Amphion, MIT, October 2024) — fully non-autoregressive TTS model from the Amphion toolkit. Trained on 100K hours of in-the-wild speech (the Emilia dataset). In benchmarks outperforms SOTA autoregressive TTS systems (including VALL-E 2, VoiceLM) on naturalness, speaker similarity, and intelligibility at significantly faster inference speed — parallel generation means inference time scales with sequence length but not linearly. Zero-shot: provide a 3-second reference audio and synthesize new text in the same voice + style. Ships as POST /tts/maskgct with {text: "...", reference_audio_b64: "..."}. Complements CosyVoice 2.0: CosyVoice is used for multilingual + streaming; MaskGCT is used for the fastest single-request batch TTS. |
POST /tts/maskgct, POST /tts/maskgct/clone, GET /tts/maskgct/info |
core/tts_maskgct.py |
Amphion repo (pip install amphion or clone + install) + MaskGCT weights (~5 GB) via HuggingFace amphion/maskgct |
M | MIT (Amphion toolkit) | Fastest parallel TTS; 100K hours training; outperforms autoregressive systems on all three key metrics |
Goal: Add OmniGen2 for multi-reference in-context generation (combine 2–4 reference images of different people/objects into one coherent output — the "actor swap" / "product placement" workflow); and SkyReels V2 for true infinite-length video generation using Diffusion Forcing architecture (generate 30-second+ videos without temporal drift).
New required deps: OmniGen2 requires qwen-vl-utils; SkyReels V2 diffusers integration
New routes: ~14
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| Q3.1 | OmniGen2 multi-reference in-context image generation (VectorSpaceLab/OmniGen2, Apache-2, June 2025) — multimodal generation model built on Qwen-VL-2.5 with two decoding pathways for text and image. Key capability for OpenCut: in-context generation — accepts 2–4 reference images (different people, different objects, different scenes) plus a text description, and generates a single coherent output that places all referenced subjects in the described scene. Example: [photo of person A] + [photo of person B] + "Person A and Person B shaking hands in an office" → one realistic image with both people. Also provides SOTA instruction-based image editing ("instruction-guided editing" mode). Fine-tunable on custom data. GGUF/TeaCache/TaylorSeer acceleration supported. |
POST /image/generate/omnigen2, POST /image/edit/omnigen2, GET /image/generate/omnigen2/info |
core/t2i_omnigen2.py |
pip install omnigen2 (or clone + install) + OmniGen2 weights (~13 GB via HuggingFace OmniGen2/OmniGen2) + qwen-vl-utils |
M | Apache-2 | SOTA multi-reference in-context generation; closes Kling/Runway "two actors in one scene" workflow gap entirely locally |
| Q3.2 | SkyReels V2 infinite-length T2V + I2V (SkyworkAI/SkyReels-V2, Skywork Community Licence — commercial allowed, April 2025) — 14B DiT video generation model implementing the Diffusion Forcing architecture for autoregressive infinite-length video generation. Unlike fixed-clip models (5-10s), SkyReels V2 generates clips of arbitrary length by overlapping temporal windows (17-frame overlap, ar_step=5 mode). Generates 720P at 24 FPS. Ships as both T2V and I2V (image-to-video). Built on Wan2.1 VAE; diffusers SkyReelsV2DiffusionForcingPipeline available. Use case: generate 30-second+ B-roll sequences from a single text prompt; cinematic long-form video. Consumer option: 1.3B variant for 540P generation. |
POST /generate/skyreels2/t2v, POST /generate/skyreels2/i2v, POST /generate/skyreels2/t2v/infinite, GET /generate/skyreels2/info |
core/gen_video_skyreels2.py |
diffusers (SkyReelsV2DiffusionForcingPipeline) + SkyReels-V2-DF-14B-720P weights (~28 GB) or 1.3B-540P (~5 GB) |
L | Skywork Community Licence (commercial use allowed) | First infinite-length T2V in the stack; closes Adobe Stock long B-roll generation gap; 30-second+ coherent clips |
| Q3.3 | SkyReels V3 talking avatar (SkyworkAI/SkyReels-V3, Skywork Community Licence — commercial allowed, January 2026) — 19B A2V (audio-to-video) model generating a lifelike talking avatar: input is a portrait image + audio track (up to 200 seconds) + optional text prompt describing expression/scene; output is a video of that person speaking with accurate lip sync and natural head movement. Built for long-form content: news reports, training videos, dubbing, virtual spokespersons. Supports Chinese, English, Korean, singing, and fast dialogue. Also includes V3-R2V (Reference-to-Video): 1–4 reference images → video preserving all subjects. | POST /generate/skyreels3/avatar, POST /generate/skyreels3/r2v, GET /generate/skyreels3/info |
core/gen_video_skyreels3.py |
SkyReels V3 repo + SkyReels-V3-A2V-19B weights (~38 GB) + SkyReels-V3-R2V-14B (~28 GB); --offload required for consumer hardware |
L | Skywork Community Licence (commercial use allowed; gated HF download) | First high-quality talking avatar model in the stack; closes Adobe Podcast video production gap; 200-second audio support |
| Tool | Org | Licence | Category | Wave Q Role |
|---|---|---|---|---|
| VACE (Wan2.1-VACE) | ali-vilab / Wan-AI | Apache-2 | Video Compositing | Q1.1 — all-in-one V2V/MV2V/R2V compositing suite |
| VACE-Annotators | ali-vilab | Apache-2 | Preprocessors | Q1.2 — depth/flow/edge preprocessors for VACE tasks |
| CosyVoice 2.0 / Fun-CosyVoice 3.0 | FunAudioLLM | Apache-2 | TTS | Q2.1 — 9-language + 18-dialect multilingual streaming TTS |
| MaskGCT | open-mmlab/Amphion | MIT | TTS | Q2.2 — zero-shot parallel (non-AR) TTS; fastest inference |
| OmniGen2 | VectorSpaceLab | Apache-2 | T2I + Editing | Q3.1 — multi-reference in-context image generation; SOTA instruction editing |
| SkyReels V2 | SkyworkAI | Skywork CL (commercial ✓) | T2V + I2V | Q3.2 — infinite-length T2V/I2V via Diffusion Forcing; 720P 24 FPS |
| SkyReels V3 | SkyworkAI | Skywork CL (commercial ✓, gated) | A2V + R2V | Q3.3 — talking avatar (200s audio); multi-reference video synthesis |
| IndexTTS2 | bilibili / IndexTeam | Custom (contact for commercial) | TTS | NOT ADOPTED — commercial contact required; no clear Apache-2/MIT grant |
| FireRedTTS | FireRedTeam | MPL 2.0 | TTS | NOT ADOPTED — MPL 2.0 file-level copyleft; complex licence situation |
| Stable Virtual Camera | Stability AI | Gated (custom, HF auth required) | Novel View Synthesis | NOT ADOPTED — gated HF access; Stability AI custom licence (likely non-commercial) |
| Vevo / Vevo2 | open-mmlab/Amphion | MIT | Voice Conversion | WATCH LIST — MIT ✓; voice timbre + style + emotion conversion; defer to Wave R if voice conversion demand grows |
| Gap | Competitor | Wave Q Feature | Closes? |
|---|---|---|---|
| Video region compositing (mask + edit) | Adobe Firefly (Generative Fill for video), RunwayML Gen-3 inpainting | Q1.1 VACE MV2V | Y — mask any region, edit with text prompt, preserve rest of video |
| Move a specific object to a new position in video | Runway "Move Brush" | Q1.1 VACE Move-Anything | Y — mask subject, prompt new position |
| Replace object/person in existing video | Runway "Swap Brush", Pika Effects | Q1.1 VACE Swap-Anything | Y — mask region, swap with any text/image |
| Insert a reference-image object into a video | Adobe After Effects compositing | Q1.1 VACE Reference-Anything | Y — reference image + mask zone → coherent insertion |
| Chinese / multilingual TTS with dialect control | ElevenLabs multilingual, 11Labs, Kling | Q2.1 CosyVoice 2.0 | Y — 9 languages + 18 Chinese dialects; locally |
| Sub-200ms TTS streaming latency | ElevenLabs streaming | Q2.1 CosyVoice SSE streaming | Y — 150 ms first-chunk SSE streaming |
| Non-autoregressive parallel TTS (fastest batch) | Commercial batch TTS APIs | Q2.2 MaskGCT | Y — parallel non-AR; fastest per-request inference |
| Multi-person image composition (2–4 reference subjects) | Midjourney "omni reference", Kling "multi-subject", Runway | Q3.1 OmniGen2 in-context | Y — 2–4 reference images → coherent composite scene |
| Infinite-length video (>10s, no drift) | Runway Gen-3 Extend, Kling Continue | Q3.2 SkyReels V2 Diffusion Forcing | Y — 30–60s+ video with temporal coherence, 720P 24 FPS |
| Talking avatar from portrait + audio (long-form) | HeyGen, Synthesia, Runway Act-Two | Q3.3 SkyReels V3 A2V | Y — 200-second audio → lifelike portrait animation; local inference |
- VACE mask format: VACE's MV2V tasks require a binary mask video (same temporal length as the input video, white = edit region, black = preserve). Add a
POST /compose/vace/mask/from_framehelper that accepts a single-frame mask image and replicates it across N frames; expose a mask-painter layer in the UXP compositing panel that writes to this format. - VACE task vs. mode: VACE's pipeline distinguishes "task" (what preprocessing to run:
depth,inpainting,flow,edge,reference, etc.) from the inference "base" (wan vs. ltx). Always usewanas the base for production quality. VACE-LTX-Video-0.9 uses RAIL-M licence — do not ship it; always use the Apache-2VACE-Wan2.1-1.3BorWan2.1-VACE-14Bvariants. - CosyVoice 2.0 WeTextProcessing dep: CosyVoice requires
WeTextProcessing(a Chinese text normalization library) for Chinese input. It pulls in additional C++ extensions. IfWeTextProcessingfails to install on Windows (common), fall back to--use_tn=Falsemode which skips normalization (slightly lower Chinese punctuation accuracy). Document the fallback incore/tts_cosyvoice.py. - CosyVoice streaming vs. batch: The SSE streaming endpoint (
/tts/cosyvoice/stream) requires the CosyVoice bi-streaming mode and a persistent async generator. Use FastAPI'sStreamingResponsewith anasynciogenerator wrapping the CosyVoice streaming API. Test that the SSE connection is properly closed when the generator completes. - MaskGCT espeak-ng on Windows: MaskGCT (via Amphion) uses
phonemizerwhich requiresespeak-ngas a system dependency. On Windows,espeak-ngis not installable via pip — it requires a separate installer fromespeak-ng.github.io. Add a pre-flight check:shutil.which("espeak-ng"). If absent, return{available: false, reason: "espeak_ng_required", install_url: "https://espeak-ng.github.io/espeak-ng/"}instead of crashing. - OmniGen2 flash-attention optional: OmniGen2 explicitly states it works without
flash-attn(as of June 23, 2025 update). Do not addflash-attnas a hard requirement — leave it as an optional performance upgrade. This matters on Windows whereflash-attncompilation is difficult. - OmniGen2 in-context vs. editing: OmniGen2 uses different prompt formats for in-context generation vs. instruction editing. In-context: images are injected inline as
<img>tokens in the prompt. Instruction editing: prompt describes the edit in plain text. Add amodeparameter toPOST /image/generate/omnigen2:mode: "t2i" | "in_context" | "edit". - SkyReels V2 Skywork CL: Unlike Apache-2, the Skywork Community License requires that you read and comply with the Skywork Model Community License Agreement (PDF at
github.com/SkyworkAI/Skywork/). Specifically: no misuse for illegal activities, no bypassing safety reviews for internet services. These are behavioural restrictions, not commercial restrictions. Document in aSKYWORK_LICENCE_NOTICE.txtshipped with OpenCut. - SkyReels V2 infinite generation memory: Long video generation (60+ seconds) accumulates VRAM for the window context. With
overlap_history=17andoffload=True, generation is feasible on a 24 GB card but slow. Add amax_duration_secondsparameter that caps at 30s for 24 GB cards and 60s for 40 GB+ cards, auto-detected fromcheck_skyreels2_available(). - SkyReels V3 A2V gated download: SkyReels-V3 weights require HuggingFace authentication (
huggingface-cli login). Use the same HF token mechanism as CSM-1B / HiDream-I1 / Open-Sora 2.0 — the user'sHF_TOKENenv var or theopencut/config.jsonhf_tokenfield. The Skywork licence does not restrict commercial use; the gate is for usage tracking only. - SkyReels V3 A2V single portrait limitation: The A2V model (talking avatar) takes a single portrait image as input. Multi-person A2V is not yet supported. Document as
max_subjects: 1in the info endpoint. For the R2V model, 1–4 reference images are supported (same pattern as OmniGen2 in-context).
| Release | ETA | Features |
|---|---|---|
| v1.48.0 | 2028-Q2 | Q1.1 VACE V2V/MV2V/R2V compositing, Q1.2 VACE Annotators preprocessors |
| v1.49.0 | 2028-Q2 | Q2.1 CosyVoice 2.0 + streaming, Q2.2 MaskGCT parallel TTS |
| v1.50.0 | 2028-Q3 | Q3.1 OmniGen2 multi-reference in-context, Q3.2 SkyReels V2 infinite T2V |
| v1.51.0 | 2028-Q3 | Q3.3 SkyReels V3 talking avatar + multi-reference video synthesis |
- IndexTTS2 (index-tts/index-tts, bilibili/IndexTeam, September 2025) — Zero-shot autoregressive TTS with emotion control and precise duration control. However, the README explicitly states "Please contact the authors for more detailed information. For commercial usage and cooperation, please contact [indexspeech@bilibili.com]." No clear Apache-2/MIT grant in the licence file. NOT adopted until a clean permissive licence is confirmed. Monitor for a licence update.
- FireRedTTS / FireRedTTS-1S (FireRedTeam/FireRedTTS, Mozilla Public License 2.0) — Streamable foundation TTS with flow-matching decoder. MPL 2.0 is technically a commercial-use-allowed licence, but it is file-level copyleft: any modifications to MPL-licensed files must remain MPL-licensed. Since OpenCut would use it as a dependency without modifying its source files, this is technically fine; however, MPL-2.0 integration requires a legal review before shipping as part of a commercial product. Defer to Wave R pending a legal confirmation pass. The TTS space is now well-covered by CosyVoice (Apache-2) and MaskGCT (MIT).
- Stable Virtual Camera / SEVA (Stability-AI/stable-virtual-camera, March 2025) — Generalist novel view synthesis model; 1.3B. Weights are gated (requires HuggingFace login + form submission to
stabilityai/stable-virtual-camera). Licence is not Apache-2 — Stability AI Community Licence is non-commercial for weights. NOT adopted. The novel view synthesis use case (generate camera fly-around from static images) has no current Apache-2 open model; monitor for a permissive alternative. - Vevo / Vevo2 (open-mmlab/Amphion, MIT, 2025/2026) — Zero-shot voice conversion: timbre, style, accent, and emotion transfer from a reference audio clip. MIT ✓ and high quality, but voice conversion (changing one person's voice to sound like another) is already partially covered by CosyVoice 2.0 zero-shot cloning in Q2.1. Voice conversion is a niche use case within OpenCut's primary workflow. Add to WATCH LIST — if user demand for voice timbre transfer grows post-Wave Q, adopt Vevo2 as a Wave R add-on.
- SkyCaptioner-V1 (SkyworkAI, Skywork CL) — Video captioning model shipped alongside SkyReels V2. Generates detailed text descriptions of video clips. Functionality already partly covered by Qwen2.5-VL (Wave N3.2) and Qwen2.5-Omni (Wave P3.1). Evaluate in Wave R if more targeted video captioning (structured caption format for T2V prompt generation) is needed.
- EzAudio — Apache-2 text-to-audio; ICCV 2025. Defer to Wave R; text-to-audio (foley effects) is an unaddressed gap that merits its own Wave R feature slot with proper research into 2025–2026 SOTA options.
- Step-Video (Kuaishou/KwaiVGI) — 30B T2V; Apache-2 for code. Still cannot confirm the weight licence for the full 30B checkpoint is clearly permissive. Monitor HuggingFace for a clean Apache-2 full-model release. If confirmed, Step-Video would join Open-Sora 2.0 and SkyReels V2 as the highest-quality T2V tier.
- VACE — ali-vilab/VACE (Apache-2, March 2025); ICCV 2025; Wan2.1-VACE-1.3B (480×832, Apache-2) + Wan2.1-VACE-14B (720×1280, Apache-2) at HuggingFace
Wan-AI/Wan2.1-VACE-1.3BandWan-AI/Wan2.1-VACE-14B; VACE-Annotators for preprocessing - CosyVoice 2.0 — FunAudioLLM/CosyVoice (Apache-2, Dec 2024);
CosyVoice2-0.5Bat HuggingFaceFunAudioLLM/CosyVoice2-0.5B; bi-streaming 150ms latency; Fun-CosyVoice3-0.5B-2512 (Dec 2025) is current best version - MaskGCT — open-mmlab/Amphion (MIT);
amphion/maskgcton HuggingFace; arXiv:2409.00750; fully non-autoregressive TTS; 100K hours Emilia training data - OmniGen2 — VectorSpaceLab/OmniGen2 (Apache-2, June 2025); arXiv:2506.18871;
OmniGen2/OmniGen2on HuggingFace; TeaCache/TaylorSeer acceleration; ComfyUI official support - SkyReels V2 — SkyworkAI/SkyReels-V2 (Skywork Community Licence — commercial allowed, April 2025); arXiv:2504.13074; Diffusion Forcing architecture for infinite-length video; diffusers
SkyReelsV2DiffusionForcingPipeline; 14B (720P) + 1.3B (540P) variants - SkyReels V3 — SkyworkAI/SkyReels-V3 (Skywork Community Licence — commercial allowed, January 2026); arXiv:2601.17323; SkyReels-V3-R2V-14B (1–4 reference image → video) + SkyReels-V3-V2V-14B (video extension) + SkyReels-V3-A2V-19B (talking avatar, 200s audio)
- IndexTTS2 — index-tts/index-tts (bilibili); arXiv:2506.21619; Sept 2025; contact required for commercial use — NOT adopted
- FireRedTTS-1S — FireRedTeam/FireRedTTS (MPL 2.0); arXiv:2503.20499; streamable TTS with flow-matching — NOT adopted (MPL requires legal review)
- Stable Virtual Camera — Stability-AI/stable-virtual-camera; gated HF weights; custom Stability AI licence — NOT adopted (non-commercial weights)
- Vevo — open-mmlab/Amphion (MIT); voice conversion framework — WATCH LIST
- Vevo2 — open-mmlab/Amphion (MIT, March 2026); unified speech + singing voice generation — WATCH LIST
Baseline: v1.51.0 (post-Wave Q) Goal: Close the last major audio gap (foley / sound-effects synthesis), add audio-driven lip sync for talking-head and dubbing workflows, bolt trajectory-based camera control onto existing Wan/CogVideoX video models, and surface Mochi-1 (10 B, consumer-GPU) plus Step-Video-T2V-Turbo (30 B, HPC) as high-fidelity long-video T2V options.
| # | Feature | Route(s) | Module | Dep | Effort | Licence | Why it matters |
|---|---|---|---|---|---|---|---|
| R1.1 | EzAudio T2A sound-effects generation | POST /generate/ezaudio |
services/ezaudio_service.py |
OpenSound/EzAudio (s3_xl) |
M | MIT | First dedicated foley/SFX engine in the stack; fills gap deferred since Wave L |
| R1.2 | EzAudio audio inpainting (mask-region edit) | POST /edit/audio/inpaint |
services/ezaudio_service.py |
EzAudio s3_xl |
S | MIT | Replace a 5 s clip mid-track without re-generating the whole file |
| R1.3 | EzAudio ControlNet reference-audio conditioning | POST /generate/ezaudio_controlnet |
services/ezaudio_service.py |
EzAudio ControlNet ckpt (energy model) | M | MIT | Match energy/timbre of an existing reference; critical for sound-design consistency |
| R2.1 | MuseTalk 1.5 audio-driven lip sync | POST /animate/lip_sync |
services/musetalk_service.py |
TMElyralab/MuseTalk (MIT code / CreativeML-OpenRAIL-M weights) |
L | MIT / CreativeML-OpenRAIL-M | Real-time 30 fps+ lip sync; enables AI-avatar dubbing and multi-language post-sync |
| R2.2 | VideoX-Fun camera-control I2V (trajectory) | POST /generate/videox_fun/camera |
services/videox_fun_service.py |
aigc-apps/VideoX-Fun, Wan2.1-Fun-14B-Control |
L | Apache-2 | Pan / zoom / orbit trajectories on Wan2.1-Fun; plugs the camera-path gap |
| R2.3 | VideoX-Fun structural control I2V (Canny / Depth / Pose / MLSD) | POST /generate/videox_fun/control |
services/videox_fun_service.py |
VideoX-Fun control ckpts | M | Apache-2 | ControlNet-style motion conditioning; complements VACE (Wave Q1) with lighter-weight control signals |
| R3.1 | Mochi-1 consumer T2V | POST /generate/mochi |
services/mochi_service.py |
genmo/mochi-1-preview (Apache-2) |
L | Apache-2 | 10 B; best open motion fidelity; --cpu_offload path for 16 GB VRAM; LoRA fine-tune support |
| R3.2 | Step-Video-T2V-Turbo HPC T2V | POST /generate/stepvideo |
services/stepvideo_service.py |
stepfun-ai/stepvideo-t2v-turbo (MIT) |
XL | MIT | 30 B; 204 frames (≈ 6.8 s @ 30 fps); bilingual EN+ZH; 10–15 step turbo; gates behind STEPVIDEO_ENABLED |
| R3.3 | Step-Video-Ti2V (image-to-video companion) | POST /generate/stepvideo/i2v |
services/stepvideo_service.py |
stepfun-ai/stepvideo-ti2v (MIT) |
M | MIT | Same 30 B engine, image-conditioned; reuses service scaffolding from R3.2 |
| Project | Repo | Licence | Est. Stars | VRAM | Status |
|---|---|---|---|---|---|
| EzAudio | haidog-yaqub/EzAudio | MIT ✓ | ~800 | ~6–8 GB | Interspeech 2025 oral; diffusers-compatible |
| MuseTalk 1.5 | TMElyralab/MuseTalk | MIT code / CreativeML-OpenRAIL-M weights | ~10 K | ~4 GB | v1.5 March 2025; training code open April 2025 |
| VideoX-Fun (Wan2.1-Fun) | aigc-apps/VideoX-Fun | Apache-2 ✓ | ~2 K | 12–24 GB | Camera control + Wan2.2 support Oct 2025 |
| Mochi-1 | genmoai/mochi | Apache-2 ✓ | ~5 K | ~30 GB (20 GB w/ offload) | Nov 2024; ComfyUI consumer-GPU support Nov 2024 |
| Step-Video-T2V-Turbo | stepfun-ai/Step-Video-T2V | MIT ✓ (code + weights) | ~2 K | ~80 GB (4× 80 GB GPU) | Feb 2025; DiffSynth-Studio quantization path available |
| Step-Video-Ti2V | stepfun-ai/Step-Video-Ti2V | MIT ✓ | ~500 | ~80 GB (4× 80 GB GPU) | March 2025; I2V companion to T2V |
| AudioGen (Meta) | facebookresearch/audiocraft | CC-BY-NC-4.0 ❌ | ~21 K | — | Not adopted |
| AudioLDM2 | haoheliu/AudioLDM2 | CC-BY-NC-SA-4.0 ❌ | ~3 K | — | Not adopted |
| Stable Audio Open | Stability-AI/stable-audio-tools | Stability AI non-commercial ❌ | ~4 K | — | Not adopted |
| Capability | Pre-Wave R | Post-Wave R |
|---|---|---|
| Text-to-sound effects / foley | None | EzAudio (MIT, DiT, 44.1 kHz) |
| Audio inpainting (mid-clip edit) | None | EzAudio mask-based inpainting |
| Reference-audio conditioning | None | EzAudio ControlNet |
| Talking head / lip sync (real-time) | SkyReels V3 A2V (Wave Q3.3, offline) | + MuseTalk 1.5 (30 fps+ real-time, multi-language) |
| Camera-trajectory video generation | None | VideoX-Fun Wan2.1-Fun camera control |
| Structural control I2V (pose, depth) | VACE (Wave Q1) | + VideoX-Fun Canny / Depth / Pose / MLSD modes |
| High-motion-fidelity consumer T2V | Open-Sora 2.0, SkyReels V2 | + Mochi-1 (10 B, --cpu_offload, LoRA) |
| 30 B long-video HPC T2V | None | Step-Video-T2V-Turbo (204 frames, bilingual) |
| I2V from still image at HPC tier | None | Step-Video-Ti2V (same 30 B engine, image-conditioned) |
- EzAudio ControlNet checkpoint: separate from the base
s3_xlmodel (OpenSound/EzAudio-ControlNet); both must be downloaded incheck_ezaudio_available(). - MuseTalk CreativeML-OpenRAIL-M weights: commercial use allowed; ship
MUSETALK_LICENCE_NOTICE.txtalongside weights. The licence prohibits generating misleading content, synthesizing the voice/face of a real person without consent, and creating harmful deepfakes. Add a consent acknowledgment checkbox to the lip-sync UI panel. - MuseTalk face region 256×256: optimized for portrait-orientation close-up faces; wide-angle crowd shots degrade quality. Recommend face-crop preprocessing (
mediapipeorInsightFace) before passing to MuseTalk. - MuseTalk WhisperTiny dependency: requires
openai/whisper(MIT) — already installed if CSM-1B (Wave N) is active. Reuse the existingwhisper-tinydownload path. - VideoX-Fun Wan2.1-Fun camera ckpt: requires
Wan2.1-Fun-14B-Control(not the base Wan2.1 already in stack from Wave M). Separate download; reuse existingWAN_MODEL_DIRparent directory for storage locality. - Mochi-1 VRAM: 10 B model needs ~30 GB unquantized;
--cpu_offloadreduces to ~20 GB GPU + system RAM but increases generation time (~5 min per clip). Expose a "quality mode" vs. "memory mode" toggle in the UI. - Mochi-1 diffusers integration: uses
MochiPipeline(diffusers ≥ 0.31); compatible with the diffusers version required by other pipeline models already in stack. - Step-Video-T2V Linux-only text encoder: the
step_llmtext encoder uses CUDA kernels that only compile on Linux (sm_80 / sm_86 / sm_90 required). Gate behindSTEPVIDEO_ENABLED=1env flag; show a Linux-only warning in the UI when running on Windows or macOS. - Step-Video-T2V quantization: DiffSynth-Studio supports int8 quantization, reducing VRAM to approximately 2× 40 GB (two A100-40 GB). Add
STEPVIDEO_QUANTIZE=int8env option with a note that int8 reduces output quality slightly. - Step-Video-Ti2V reuse: Step-Video-Ti2V shares the same DiT and VAE as T2V; the image condition is injected as an extra token. The R3.2 service setup covers R3.3 with minimal extra scaffolding.
| Version | Deliverable |
|---|---|
| v1.52.0 | R1: EzAudio T2A + inpainting + ControlNet endpoints; check_ezaudio_available() guard; 44.1 kHz output |
| v1.53.0 | R2a: MuseTalk 1.5 lip sync endpoint; consent UI; MUSETALK_LICENCE_NOTICE.txt |
| v1.54.0 | R2b: VideoX-Fun camera-control I2V + control-signal I2V (Canny/Depth/Pose/MLSD) endpoints |
| v1.55.0 | R3: Mochi-1 T2V (--cpu_offload default on < 24 GB cards) + Step-Video-T2V-Turbo + Ti2V (STEPVIDEO_ENABLED, Linux-only) |
| Project | Reason |
|---|---|
| AudioGen (Meta / AudioCraft) | CC-BY-NC-4.0 ❌ |
| AudioLDM2 (HKUST) | CC-BY-NC-SA-4.0 ❌ |
| Stable Audio Open (Stability AI) | Stability AI non-commercial licence ❌ |
| DepthCrafter (Tencent) | Academic/research only ❌ (re-checked from Wave P survey) |
- EzAudio — haidog-yaqub/EzAudio (MIT); arXiv:2409.10819;
OpenSound/EzAudioon HuggingFace;OpenSound/EzAudio-ControlNetfor reference-audio variant; Interspeech 2025 oral - MuseTalk 1.5 — TMElyralab/MuseTalk (MIT code / CreativeML-OpenRAIL-M weights); arXiv:2410.10122;
TMElyralab/MuseTalkon HuggingFace; v1.5 released March 2025; training code open April 2025 - VideoX-Fun — aigc-apps/VideoX-Fun (Apache-2);
alibaba-pai/Wan2.1-Fun-14B-Controlon HuggingFace; camera control + control-signal I2V; Wan2.2 + VACE support added Oct 2025 - Mochi-1 — genmoai/mochi (Apache-2); genmo/mochi-1-preview on HuggingFace;
--cpu_offloadconsumer path; ComfyUI support Nov 2024; LoRA fine-tuning Nov 2024 - Step-Video-T2V — stepfun-ai/Step-Video-T2V (MIT); arXiv:2502.10248;
stepfun-ai/stepvideo-t2v+stepfun-ai/stepvideo-t2v-turboon HuggingFace; DiffSynth-Studio quantization path; released Feb 2025 - Step-Video-Ti2V — stepfun-ai/Step-Video-Ti2V (MIT); image-to-video companion; released March 2025