Zen Translator

Real-time multimodal translation with voice cloning and lip synchronization.

Overview

Zen Translator combines three state-of-the-art models into a sub-second end-to-end pipeline:

Component Model Parameters Latency
Translation Qwen3-Omni-30B-A3B 30B (3B active MoE) ~500ms
Voice Cloning CosyVoice 2.0 0.5B ~150ms
Lip Sync Wav2Lip ~100M ~200ms
Total - - <1 second

Features

  • 18 input languages including Chinese dialects (Cantonese, Shanghainese, etc.)
  • 10 output languages with high-quality voice synthesis
  • 3-second voice cloning - Preserve speaker characteristics with minimal reference audio
  • Real-time streaming - WebSocket API with <500ms first packet latency
  • Lip synchronization - Natural video dubbing for translated content
  • News anchor training - Domain-specific finetuning for broadcast translation

Quick Start

# Clone repository
git clone https://github.com/zenlm/zen-translator.git
cd zen-translator

# Install with uv
make install

# Download models (~62GB full, ~16GB quantized)
make download
# OR
make download-quantized

# Start server
make serve

Usage

Python API

from zen_translator import TranslationPipeline, TranslatorConfig

config = TranslatorConfig(target_language="es")
pipeline = TranslationPipeline(config)
await pipeline.load()

# Register speaker voice (3+ seconds of audio)
await pipeline.register_speaker("john_doe", "reference.wav")

# Translate video with voice cloning and lip sync
result = await pipeline.translate_video(
    video="news.mp4",
    target_lang="es",
    speaker_id="john_doe",
    output_path="news_es.mp4"
)

CLI

# Translate a video
zen-translate video.mp4 -o translated.mp4 -t spanish

# Register a speaker
zen-translate register-speaker john_doe reference.wav

# Start the API server
zen-serve --host 0.0.0.0 --port 8000

REST API

# Translate audio
curl -X POST http://localhost:8000/translate/audio \
  -F "[email protected]" \
  -F "target_lang=es"

# Translate video with lip sync
curl -X POST http://localhost:8000/translate/video \
  -F "[email protected]" \
  -F "target_lang=zh"

WebSocket (Real-time)

const ws = new WebSocket('ws://localhost:8000/ws/translate');
ws.send(JSON.stringify({ target_lang: 'es', speaker_id: 'my_voice' }));
ws.send(audioChunk);  // Send audio chunks
ws.onmessage = (event) => {
    // Receive translated audio chunks
};

Language Support

Input Languages (18 + 6 dialects)

Language Code
English en
Chinese zh
Japanese ja
Korean ko
Spanish es
French fr
German de
Italian it
Portuguese pt
Russian ru
Arabic ar
Hindi hi
Thai th
Vietnamese vi
Indonesian id
Malay ms
Turkish tr
Polish pl
Dialects
Cantonese yue
Shanghainese wuu
Xiang hsn
Min Nan nan
Hakka hak
Min Dong cdo

Output Languages (10)

English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian

Model Requirements

Model VRAM Disk
Qwen3-Omni 16GB 60GB
CosyVoice 2.0 2GB 1GB
Wav2Lip 2GB 500MB
Total ~20GB ~62GB

For smaller deployments, use 4-bit quantized Qwen3-Omni (~15GB disk).

Training

News Anchor Adaptation

# Build dataset from news channels (CNN, BBC, NHK, DW)
make dataset-build

# Train news anchor adaptation
make train-anchor

# Or with ms-swift directly
swift sft --config outputs/anchor/train_config.yaml

Citation

@software{zen_translator,
  author = {Hanzo AI and Zen LM},
  title = {Zen Translator: Real-time Multimodal Translation with Voice Cloning},
  year = {2025},
  url = {https://github.com/zenlm/zen-translator}
}

Links

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support