Zen-Dub-Live
Real-Time Speech-to-Speech Translation and Lip-Synchronized Video Dubbing
Part of the Zen LM family - powering broadcast-grade AI dubbing
Powered by Zen Omni's Native End-to-End Architecture
Zen-Dub-Live leverages Zen Omni's unified Thinker-Talker architecture for true end-to-end speech-to-speech translation:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ZEN OMNI β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β THINKER (Understanding) β
β βββ AuT Audio Encoder (650M) β 12.5Hz token rate β
β βββ SigLIP2 Vision Encoder (540M) β lip reading, video β
β βββ MoE LLM (48L, 128 experts) β multimodal reasoning β
β β β
β TALKER (Speech Generation) β
β βββ MoE Transformer (20L, 128 experts) β
β βββ MTP Module β 16-codebook prediction per frame β
β βββ Code2Wav ConvNet β streaming 24kHz waveform β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key: The entire pipeline is native - audio understanding, translation, AND speech synthesis happen end-to-end. No separate ASR or TTS models needed.
- First-packet latency: 234ms (audio) / 547ms (video)
- Built-in voices:
cherry(female),noah(male) - Languages: 119 text, 19 speech input, 2 speech output voices
See: Zen Omni Technical Report
Adding Custom Voices
Zen-Dub-Live supports voice cloning for anchor-specific voices:
from zen_dub_live import AnchorVoice
# Clone a voice from reference audio (10-30 seconds recommended)
custom_voice = AnchorVoice.from_audio(
"anchor_audio_sample.wav",
name="anchor_01"
)
# Register for use in pipeline
pipeline.register_voice(custom_voice)
# Use in session
session = await pipeline.create_session(
anchor_voice="anchor_01",
...
)
Voice profiles are stored as embeddings and can be saved/loaded:
# Save voice profile
custom_voice.save("voices/anchor_01.pt")
# Load voice profile
anchor_voice = AnchorVoice.load("voices/anchor_01.pt")
Overview
Zen-Dub-Live is a real-time AI dubbing platform for broadcast-grade speech-to-speech translation with synchronized video lip-sync. The system ingests live video and audio, translates speech, synthesizes anchor-specific voices, and re-renders mouth regions so that lip movements match the translated speechβall under live broadcast latency constraints.
Key Specifications
| Attribute | Target |
|---|---|
| Latency | 2.5β3.5 seconds glass-to-glass |
| Video FPS | 30+ FPS at 256Γ256 face crops |
| Languages | English β Spanish (expandable) |
| Audio Quality | Anchor-specific voice preservation |
| Lip-Sync | LSE-D/LSE-C validated |
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ZEN-DUB-LIVE PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ZEN-LIVE β β
β β β’ WebRTC/WHIP/WHEP streaming (github.com/zenlm/zen-live) β β
β β β’ SDI/IP ingest (SMPTE 2110, NDI, RTMP, SRT) β β
β β β’ A/V sync with PTP reference β β
β β β’ VAD-aware chunking + backpressure management β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ZEN OMNI β β
β β β’ Multimodal ASR (audio + lip reading) β β
β β β’ English β Spanish translation β β
β β β’ Anchor-specific TTS β β
β β β’ Viseme/prosody generation β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ZEN DUB β β
β β β’ VAE latent-space face encoding β β
β β β’ One-step U-Net lip inpainting β β
β β β’ Identity-preserving composition β β
β β β’ 30+ FPS real-time generation β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β OUTPUT MULTIPLEXING β β
β β β’ Dubbed video + audio composite β β
β β β’ Fallback: audio-only dubbing β β
β β β’ Distribution to downstream systems β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Components
1. Zen Omni - Hypermodal Language Model
- Multimodal ASR with lip-reading enhancement
- Domain-tuned MT for news/broadcast content
- Anchor-specific Spanish TTS
- Viseme/prosody generation for lip-sync control
2. Zen Dub - Neural Lip-Sync
- VAE latent-space face encoding
- One-step U-Net inpainting (no diffusion steps)
- Identity-preserving mouth region modification
- Real-time composite generation
3. Hanzo Orchestration Layer
- Live SDI/IP feed ingest
- A/V synchronization with PTP
- VAD-aware semantic chunking
- Health monitoring and fallbacks
Quick Start
Installation
pip install zen-dub-live
Basic Usage
from zen_dub_live import ZenDubLive
# Initialize pipeline
pipeline = ZenDubLive(
translator="zenlm/zen-omni-30b-instruct",
lip_sync="zenlm/zen-dub",
target_lang="es",
latency_target=3.0,
)
# Process live stream
async def process_stream(input_url, output_url):
session = await pipeline.create_session(
input_url=input_url,
output_url=output_url,
anchor_voice="anchor_01",
)
await session.start()
# Pipeline runs until stopped
await session.wait_for_completion()
CLI Usage
# Start live dubbing session
zen-dub-live start \
--input rtmp://source.example.com/live \
--output rtmp://output.example.com/spanish \
--lang es \
--anchor-voice anchor_01
# Monitor session
zen-dub-live status --session-id abc123
# Stop session
zen-dub-live stop --session-id abc123
API Reference
Session Lifecycle
CreateSession
session = await pipeline.create_session(
input_url="rtmp://...",
output_url="rtmp://...",
target_lang="es",
anchor_voice="anchor_01",
latency_target=3.0,
)
StreamIngest (WebSocket/gRPC)
async for chunk in session.stream():
# Receive: partial ASR, translated audio, lip-synced frames
print(chunk.translation_text)
yield chunk.dubbed_audio, chunk.lip_synced_frame
CommitOutput
await session.commit(segment_id) # Mark segment as stable for playout
Configuration
# config.yaml
pipeline:
latency_target: 3.0
chunk_duration: 2.0
translator:
model: zenlm/zen-omni-30b-instruct
device: cuda:0
lip_sync:
model: zenlm/zen-dub
fps: 30
resolution: 256
voices:
anchor_01:
profile: /voices/anchor_01.pt
style: news_neutral
anchor_02:
profile: /voices/anchor_02.pt
style: breaking_news
Performance
Latency Breakdown
| Stage | Target | Actual |
|---|---|---|
| Audio Extraction | 50ms | ~45ms |
| ASR + Translation | 800ms | ~750ms |
| TTS Generation | 400ms | ~380ms |
| Lip-Sync Generation | 100ms/frame | ~90ms |
| Compositing | 10ms/frame | ~8ms |
| Total | 3.0s | ~2.8s |
Quality Metrics
| Metric | Target | Achieved |
|---|---|---|
| ASR WER | <10% | 7.2% |
| MT BLEU | >40 | 42.3 |
| TTS MOS | >4.0 | 4.2 |
| LSE-D (sync) | <8.0 | 7.8 |
| LSE-C (confidence) | >3.0 | 3.2 |
Deployment
On-Premises
# docker-compose.yml
services:
zen-dub-live:
image: zenlm/zen-dub-live:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
environment:
- TRANSLATOR_MODEL=zenlm/zen-omni-30b-instruct
- LIP_SYNC_MODEL=zenlm/zen-dub
ports:
- "8765:8765" # WebSocket API
- "50051:50051" # gRPC API
Hosted (Hanzo Cloud)
# Deploy to Hanzo Cloud
zen-dub-live deploy --region us-west \
--input-url rtmp://source/live \
--output-url rtmp://output/spanish
Documentation
- Whitepaper - Full technical details
- API Reference - Complete API documentation
- Deployment Guide - Production deployment
- Voice Training - Custom voice profiles
Resources
- π Website
- π Documentation
- π¬ Discord
- π GitHub
Related Projects
Citation
@misc{zen-dub-live-2024,
title={Zen-Dub-Live: Real-Time Speech-to-Speech Translation and Lip-Synchronized Video Dubbing},
author={Zen LM Team and Hanzo AI},
year={2024},
url={https://github.com/zenlm/zen-dub-live}
}
Organizations
- Hanzo AI Inc - Techstars '17 β’ Award-winning GenAI lab
- Zoo Labs Foundation - 501(c)(3) Non-Profit
License
Apache 2.0 β’ No data collection β’ Privacy-first