YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Zen Foley

Zen Foley is a professional-grade AI sound effect generation model for video content. Based on HunyuanVideo-Foley, it generates high-fidelity audio synchronized with video scenes, perfect for filmmaking, game development, and content creation.

Overview

Zen Foley generates professional sound effects synchronized with video content:

🎬 Video-to-Audio: Generate sound effects from video scenes
🎭 Multi-Scenario Sync: High-quality audio for complex scenes
🎵 48kHz Hi-Fi: Professional-grade audio output
⚖️ Multi-Modal Balance: Perfect harmony between visual and textual cues
📝 Text Control: Optional text descriptions for precise control
⚡ Efficient: XL model with offload support for lower VRAM

Model Details

Model Type: Video-to-Audio Generation (Diffusion)
Architecture: Multimodal Diffusion Transformer
License: Apache 2.0
Input: Video (MP4), optional text prompt
Output: Audio (48kHz WAV)
Duration: Up to 10 seconds
Developed by: Zen AI Team
Based on: HunyuanVideo-Foley by Tencent

Capabilities

Multi-Scenario Sound Generation

Footsteps, ambience, nature sounds
Vehicle and mechanical sounds
Action and impact effects
Musical elements and instruments
Human vocalizations and speech
Complex multi-layered soundscapes

Audio-Visual Synchronization

Frame-accurate timing
Motion-sound correspondence
Spatial audio positioning
Intensity matching
Seamless transitions

Hardware Requirements

Minimum (XL Model with Offloading)

GPU: 12GB VRAM (RTX 3080, RTX 4070 Ti)
RAM: 16GB system memory
Storage: 20GB for model

Optimal

GPU: 40GB+ VRAM (A100)
RAM: 64GB system memory
For faster generation without offloading

Installation

# Clone repository
git clone https://github.com/zenlm/zen-foley.git
cd zen-foley

# Create environment
conda create -n zen-foley python=3.10
conda activate zen-foley

# Install dependencies
pip install -r requirements.txt

# Download model
huggingface-cli download zenlm/zen-foley --local-dir ./models

Usage

Basic Video-to-Audio

python infer.py \
    --video input.mp4 \
    --output output.wav \
    --model_path ./models

With Text Prompt

python infer.py \
    --video input.mp4 \
    --prompt "Footsteps on wooden floor, gentle rain outside" \
    --output output.wav

With CPU Offloading (Lower VRAM)

python infer.py \
    --video input.mp4 \
    --output output.wav \
    --enable_offload

Python API

from zen_foley import ZenFoleyPipeline

# Initialize
pipeline = ZenFoleyPipeline.from_pretrained(
    "zenlm/zen-foley",
    enable_offload=True  # For lower VRAM
)

# Generate audio
audio = pipeline(
    video_path="input.mp4",
    prompt="Thunder and rain storm",  # Optional
    duration=10.0,
    sampling_rate=48000
)

# Save
audio.save("output.wav")

Use Cases

Film & Video Production

Post-production sound design
ADR replacement
Ambience and Foley effects
Quick prototyping

Game Development

Procedural audio generation
Dynamic sound effects
Cutscene audio
Rapid iteration

Content Creation

YouTube videos
TikTok/Shorts
Podcasts with video
Social media content

Professional Audio

Sound design
Audio post-production
Trailer editing
Commercial production

Training with Zen Gym

Fine-tune for custom sound styles:

cd /path/to/zen-gym

llamafactory-cli train \
    --config configs/zen_foley_lora.yaml \
    --dataset your_audio_video_dataset

Inference with Zen Engine

Serve Zen Foley via API:

cd /path/to/zen-engine

cargo run --release -- serve \
    --model zenlm/zen-foley \
    --port 3690

Advanced Features

Precise Timing Control

# Generate audio for specific time range
audio = pipeline(
    video_path="input.mp4",
    start_time=5.0,  # Start at 5 seconds
    duration=8.0,    # Generate 8 seconds
    prompt="Car engine revving and accelerating"
)

Multi-Track Generation

# Generate separate audio tracks
tracks = pipeline.generate_multi_track(
    video_path="input.mp4",
    track_prompts={
        "ambience": "City street ambience",
        "effects": "Car horn and traffic",
        "music": "Background jazz music"
    }
)

Batch Processing

# Process multiple videos
videos = ["video1.mp4", "video2.mp4", "video3.mp4"]
audios = pipeline.batch_generate(videos, batch_size=4)

Performance

Generation Speed

RTX 4090: ~15s for 10-second audio
RTX 4090 (offload): ~25s for 10-second audio
RTX 3080 (offload): ~40s for 10-second audio
A100: ~10s for 10-second audio

Quality Metrics

Metric	Score
FAD	2.34
KLD	1.87
IS	7.21

Prompt Engineering

Effective Prompts

Describe specific sounds: "footsteps", "door closing", "glass breaking"
Include environment: "in large hall", "outdoors", "underwater"
Specify intensity: "loud", "gentle", "distant", "close-up"
Mention materials: "wooden floor", "metal surface", "carpet"

Examples

# Environmental
"Heavy rain on roof, thunder in distance, wind through trees"

# Action
"Sword clashing, grunts, footsteps on stone floor"

# Mechanical
"Car engine starting, revving, tires screeching, horn"

# Nature
"Ocean waves crashing, seagulls calling, wind blowing"

Limitations

Maximum 10-second duration per generation
Requires high-quality input video
May struggle with very complex soundscapes
Speech generation limited
Music generation best for background/ambience
Requires significant GPU memory

Ethical Considerations

Generated audio should be labeled as AI-generated
Not suitable for deepfake audio
Respect copyright and licensing
Consider misuse for misinformation
Professional audio engineering still recommended
Environmental impact of GPU usage

Citation

@misc{zenfoley2025,
  title={Zen Foley: Professional AI Sound Effect Generation},
  author={Zen AI Team},
  year={2025},
  howpublished={\url{https://github.com/zenlm/zen-foley}}
}

@article{shan2025hunyuanvideo,
  title={HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation},
  author={Sizhe Shan and Qiulin Li and Yutao Cui and Miles Yang and Yuehai Wang and Qun Yang and Jin Zhou and Zhao Zhong},
  journal={arXiv preprint arXiv:2508.16930},
  year={2025}
}

Credits

Zen Foley is based on HunyuanVideo-Foley by Tencent Hunyuan. We thank the original authors for their excellent work in video-to-audio generation.

License

Apache 2.0 License - see LICENSE for details.

Zen Foley - Professional AI sound design for video content

Part of the Zen AI ecosystem.

Based On

zen-foley is based on HunyuanVideo-Foley

We are grateful to the original authors for their excellent work and open-source contributions.

Upstream Source

Repository: https://github.com/Tencent/HunyuanVideo
Base Model: HunyuanVideo-Foley
License: See original repository for license details

Changes in Zen LM

Adapted for Zen AI ecosystem
Fine-tuned for specific use cases
Added training and inference scripts
Integrated with Zen Gym and Zen Engine
Enhanced documentation and examples

Citation

If you use this model, please cite both the original work and Zen LM:

@misc{zenlm2025zen-foley,
    title={Zen LM: zen-foley},
    author={Hanzo AI and Zoo Labs Foundation},
    year={2025},
    publisher={HuggingFace},
    howpublished={\url{https://huggingface.co/zenlm/zen-foley}}
}

Please also cite the original upstream work - see https://github.com/Tencent/HunyuanVideo for citation details.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Zen Foley

Overview

Model Details

Capabilities

Multi-Scenario Sound Generation

Audio-Visual Synchronization

Hardware Requirements

Minimum (XL Model with Offloading)

Recommended

Optimal

Installation

Usage

Basic Video-to-Audio

With Text Prompt

With CPU Offloading (Lower VRAM)

Python API

Use Cases

Film & Video Production

Game Development

Content Creation

Professional Audio

Training with Zen Gym

Inference with Zen Engine

Advanced Features

Precise Timing Control

Multi-Track Generation

Batch Processing

Performance

Generation Speed

Quality Metrics

Prompt Engineering

Effective Prompts

Examples

Limitations

Ethical Considerations

Citation

Credits

Links

License

Part of the Zen AI ecosystem.

Based On

Upstream Source

Changes in Zen LM

Citation