Zen Translator

Real-time multimodal translation with voice cloning and lip synchronization.

Overview

Zen Translator combines three state-of-the-art models into a sub-second end-to-end pipeline:

Component	Model	Parameters	Latency
Translation	Qwen3-Omni-30B-A3B	30B (3B active MoE)	~500ms
Voice Cloning	CosyVoice 2.0	0.5B	~150ms
Lip Sync	Wav2Lip	~100M	~200ms
Total	-	-	<1 second

Features

18 input languages including Chinese dialects (Cantonese, Shanghainese, etc.)
10 output languages with high-quality voice synthesis
3-second voice cloning - Preserve speaker characteristics with minimal reference audio
Real-time streaming - WebSocket API with <500ms first packet latency
Lip synchronization - Natural video dubbing for translated content
News anchor training - Domain-specific finetuning for broadcast translation

Quick Start

# Clone repository
git clone https://github.com/zenlm/zen-translator.git
cd zen-translator

# Install with uv
make install

# Download models (~62GB full, ~16GB quantized)
make download
# OR
make download-quantized

# Start server
make serve

Usage

Python API

from zen_translator import TranslationPipeline, TranslatorConfig

config = TranslatorConfig(target_language="es")
pipeline = TranslationPipeline(config)
await pipeline.load()

# Register speaker voice (3+ seconds of audio)
await pipeline.register_speaker("john_doe", "reference.wav")

# Translate video with voice cloning and lip sync
result = await pipeline.translate_video(
    video="news.mp4",
    target_lang="es",
    speaker_id="john_doe",
    output_path="news_es.mp4"
)

CLI

# Translate a video
zen-translate video.mp4 -o translated.mp4 -t spanish

# Register a speaker
zen-translate register-speaker john_doe reference.wav

# Start the API server
zen-serve --host 0.0.0.0 --port 8000

REST API

# Translate audio
curl -X POST http://localhost:8000/translate/audio \
  -F "[email protected]" \
  -F "target_lang=es"

# Translate video with lip sync
curl -X POST http://localhost:8000/translate/video \
  -F "[email protected]" \
  -F "target_lang=zh"

WebSocket (Real-time)

const ws = new WebSocket('ws://localhost:8000/ws/translate');
ws.send(JSON.stringify({ target_lang: 'es', speaker_id: 'my_voice' }));
ws.send(audioChunk);  // Send audio chunks
ws.onmessage = (event) => {
    // Receive translated audio chunks
};

Language Support

Input Languages (18 + 6 dialects)

Language	Code
English	en
Chinese	zh
Japanese	ja
Korean	ko
Spanish	es
French	fr
German	de
Italian	it
Portuguese	pt
Russian	ru
Arabic	ar
Hindi	hi
Thai	th
Vietnamese	vi
Indonesian	id
Malay	ms
Turkish	tr
Polish	pl
Dialects
Cantonese	yue
Shanghainese	wuu
Xiang	hsn
Min Nan	nan
Hakka	hak
Min Dong	cdo

Output Languages (10)

English, Chinese, Japanese, Korean, Spanish, French, German, Italian, Portuguese, Russian

Model Requirements

Model	VRAM	Disk
Qwen3-Omni	16GB	60GB
CosyVoice 2.0	2GB	1GB
Wav2Lip	2GB	500MB
Total	~20GB	~62GB

For smaller deployments, use 4-bit quantized Qwen3-Omni (~15GB disk).

Training

News Anchor Adaptation

# Build dataset from news channels (CNN, BBC, NHK, DW)
make dataset-build

# Train news anchor adaptation
make train-anchor

# Or with ms-swift directly
swift sft --config outputs/anchor/train_config.yaml

Citation

@software{zen_translator,
  author = {Hanzo AI and Zen LM},
  title = {Zen Translator: Real-time Multimodal Translation with Voice Cloning},
  year = {2025},
  url = {https://github.com/zenlm/zen-translator}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

zenlm
/

zen-translator