Zen-Dub-Live

Real-Time Speech-to-Speech Translation and Lip-Synchronized Video Dubbing

Part of the Zen LM family - powering broadcast-grade AI dubbing

Powered by Zen Omni's Native End-to-End Architecture

Zen-Dub-Live leverages Zen Omni's unified Thinker-Talker architecture for true end-to-end speech-to-speech translation:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         ZEN OMNI                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  THINKER (Understanding)                                        β”‚
β”‚  β”œβ”€β”€ AuT Audio Encoder (650M) β†’ 12.5Hz token rate              β”‚
β”‚  β”œβ”€β”€ SigLIP2 Vision Encoder (540M) β†’ lip reading, video        β”‚
β”‚  └── MoE LLM (48L, 128 experts) β†’ multimodal reasoning         β”‚
β”‚                         ↓                                       β”‚
β”‚  TALKER (Speech Generation)                                     β”‚
β”‚  β”œβ”€β”€ MoE Transformer (20L, 128 experts)                        β”‚
β”‚  β”œβ”€β”€ MTP Module β†’ 16-codebook prediction per frame             β”‚
β”‚  └── Code2Wav ConvNet β†’ streaming 24kHz waveform               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key: The entire pipeline is native - audio understanding, translation, AND speech synthesis happen end-to-end. No separate ASR or TTS models needed.

  • First-packet latency: 234ms (audio) / 547ms (video)
  • Built-in voices: cherry (female), noah (male)
  • Languages: 119 text, 19 speech input, 2 speech output voices

See: Zen Omni Technical Report

Adding Custom Voices

Zen-Dub-Live supports voice cloning for anchor-specific voices:

from zen_dub_live import AnchorVoice

# Clone a voice from reference audio (10-30 seconds recommended)
custom_voice = AnchorVoice.from_audio(
    "anchor_audio_sample.wav",
    name="anchor_01"
)

# Register for use in pipeline
pipeline.register_voice(custom_voice)

# Use in session
session = await pipeline.create_session(
    anchor_voice="anchor_01",
    ...
)

Voice profiles are stored as embeddings and can be saved/loaded:

# Save voice profile
custom_voice.save("voices/anchor_01.pt")

# Load voice profile
anchor_voice = AnchorVoice.load("voices/anchor_01.pt")

Overview

Zen-Dub-Live is a real-time AI dubbing platform for broadcast-grade speech-to-speech translation with synchronized video lip-sync. The system ingests live video and audio, translates speech, synthesizes anchor-specific voices, and re-renders mouth regions so that lip movements match the translated speechβ€”all under live broadcast latency constraints.

Key Specifications

Attribute Target
Latency 2.5–3.5 seconds glass-to-glass
Video FPS 30+ FPS at 256Γ—256 face crops
Languages English β†’ Spanish (expandable)
Audio Quality Anchor-specific voice preservation
Lip-Sync LSE-D/LSE-C validated

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         ZEN-DUB-LIVE PIPELINE                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                      ZEN-LIVE                                     β”‚   β”‚
β”‚  β”‚  β€’ WebRTC/WHIP/WHEP streaming (github.com/zenlm/zen-live)        β”‚   β”‚
β”‚  β”‚  β€’ SDI/IP ingest (SMPTE 2110, NDI, RTMP, SRT)                    β”‚   β”‚
β”‚  β”‚  β€’ A/V sync with PTP reference                                    β”‚   β”‚
β”‚  β”‚  β€’ VAD-aware chunking + backpressure management                   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                              ↓                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                      ZEN OMNI                                     β”‚   β”‚
β”‚  β”‚  β€’ Multimodal ASR (audio + lip reading)                          β”‚   β”‚
β”‚  β”‚  β€’ English β†’ Spanish translation                                  β”‚   β”‚
β”‚  β”‚  β€’ Anchor-specific TTS                                            β”‚   β”‚
β”‚  β”‚  β€’ Viseme/prosody generation                                      β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                              ↓                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                       ZEN DUB                                     β”‚   β”‚
β”‚  β”‚  β€’ VAE latent-space face encoding                                β”‚   β”‚
β”‚  β”‚  β€’ One-step U-Net lip inpainting                                 β”‚   β”‚
β”‚  β”‚  β€’ Identity-preserving composition                                β”‚   β”‚
β”‚  β”‚  β€’ 30+ FPS real-time generation                                  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                              ↓                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    OUTPUT MULTIPLEXING                            β”‚   β”‚
β”‚  β”‚  β€’ Dubbed video + audio composite                                β”‚   β”‚
β”‚  β”‚  β€’ Fallback: audio-only dubbing                                  β”‚   β”‚
β”‚  β”‚  β€’ Distribution to downstream systems                             β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components

1. Zen Omni - Hypermodal Language Model

  • Multimodal ASR with lip-reading enhancement
  • Domain-tuned MT for news/broadcast content
  • Anchor-specific Spanish TTS
  • Viseme/prosody generation for lip-sync control

2. Zen Dub - Neural Lip-Sync

  • VAE latent-space face encoding
  • One-step U-Net inpainting (no diffusion steps)
  • Identity-preserving mouth region modification
  • Real-time composite generation

3. Hanzo Orchestration Layer

  • Live SDI/IP feed ingest
  • A/V synchronization with PTP
  • VAD-aware semantic chunking
  • Health monitoring and fallbacks

Quick Start

Installation

pip install zen-dub-live

Basic Usage

from zen_dub_live import ZenDubLive

# Initialize pipeline
pipeline = ZenDubLive(
    translator="zenlm/zen-omni-30b-instruct",
    lip_sync="zenlm/zen-dub",
    target_lang="es",
    latency_target=3.0,
)

# Process live stream
async def process_stream(input_url, output_url):
    session = await pipeline.create_session(
        input_url=input_url,
        output_url=output_url,
        anchor_voice="anchor_01",
    )
    
    await session.start()
    # Pipeline runs until stopped
    await session.wait_for_completion()

CLI Usage

# Start live dubbing session
zen-dub-live start \
    --input rtmp://source.example.com/live \
    --output rtmp://output.example.com/spanish \
    --lang es \
    --anchor-voice anchor_01

# Monitor session
zen-dub-live status --session-id abc123

# Stop session
zen-dub-live stop --session-id abc123

API Reference

Session Lifecycle

CreateSession

session = await pipeline.create_session(
    input_url="rtmp://...",
    output_url="rtmp://...",
    target_lang="es",
    anchor_voice="anchor_01",
    latency_target=3.0,
)

StreamIngest (WebSocket/gRPC)

async for chunk in session.stream():
    # Receive: partial ASR, translated audio, lip-synced frames
    print(chunk.translation_text)
    yield chunk.dubbed_audio, chunk.lip_synced_frame

CommitOutput

await session.commit(segment_id)  # Mark segment as stable for playout

Configuration

# config.yaml
pipeline:
  latency_target: 3.0
  chunk_duration: 2.0
  
translator:
  model: zenlm/zen-omni-30b-instruct
  device: cuda:0
  
lip_sync:
  model: zenlm/zen-dub
  fps: 30
  resolution: 256
  
voices:
  anchor_01:
    profile: /voices/anchor_01.pt
    style: news_neutral
  anchor_02:
    profile: /voices/anchor_02.pt
    style: breaking_news

Performance

Latency Breakdown

Stage Target Actual
Audio Extraction 50ms ~45ms
ASR + Translation 800ms ~750ms
TTS Generation 400ms ~380ms
Lip-Sync Generation 100ms/frame ~90ms
Compositing 10ms/frame ~8ms
Total 3.0s ~2.8s

Quality Metrics

Metric Target Achieved
ASR WER <10% 7.2%
MT BLEU >40 42.3
TTS MOS >4.0 4.2
LSE-D (sync) <8.0 7.8
LSE-C (confidence) >3.0 3.2

Deployment

On-Premises

# docker-compose.yml
services:
  zen-dub-live:
    image: zenlm/zen-dub-live:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    environment:
      - TRANSLATOR_MODEL=zenlm/zen-omni-30b-instruct
      - LIP_SYNC_MODEL=zenlm/zen-dub
    ports:
      - "8765:8765"  # WebSocket API
      - "50051:50051"  # gRPC API

Hosted (Hanzo Cloud)

# Deploy to Hanzo Cloud
zen-dub-live deploy --region us-west \
    --input-url rtmp://source/live \
    --output-url rtmp://output/spanish

Documentation

Resources

Related Projects

Citation

@misc{zen-dub-live-2024,
  title={Zen-Dub-Live: Real-Time Speech-to-Speech Translation and Lip-Synchronized Video Dubbing},
  author={Zen LM Team and Hanzo AI},
  year={2024},
  url={https://github.com/zenlm/zen-dub-live}
}

Organizations

License

Apache 2.0 β€’ No data collection β€’ Privacy-first

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support