--- license: other library_name: diffusers pipeline_tag: image-to-video tags: - wan - image-to-video - video-generation --- # WAN 2.1 FP16 - 720p Image-to-Video Model High-fidelity 720p image-to-video generation model in full FP16 precision for maximum quality output. ## Model Description WAN (Wan An) 2.1 is a state-of-the-art transformer-based diffusion model for image-to-video generation. This repository contains the **720p variant** in full FP16 (16-bit floating point) precision, providing the highest quality video generation with enhanced detail and clarity. The model transforms static images into dynamic video sequences with temporal consistency and cinematic quality. **Key Capabilities**: - Image-to-video generation at 720p resolution - 14 billion parameter transformer architecture - Full FP16 precision for maximum generation quality - High temporal consistency across frames - Compatible with camera control LoRAs (available separately) ## Repository Contents ``` wan21-fp16-720p/ └── diffusion_models/ └── wan/ └── wan21-i2v-720p-14b-fp16.safetensors (31 GB) ``` **Total Repository Size**: ~31 GB ### Model Files | File | Size | Description | |------|------|-------------| | `diffusion_models/wan/wan21-i2v-720p-14b-fp16.safetensors` | 31 GB | WAN 2.1 Image-to-Video 720p transformer model (14B parameters, FP16) | **Note**: This repository contains **only the 720p diffusion model**. For complete functionality, you will also need: - **WAN 2.1 VAE** (available separately, ~243 MB) - **Camera Control LoRAs** (optional, for cinematic camera movements, ~343 MB each) ## Hardware Requirements ### Minimum Requirements - **VRAM**: 40GB+ (FP16 full precision at 720p) - **Disk Space**: 31 GB for model storage - **System RAM**: 32GB+ recommended - **GPU**: High-end NVIDIA GPU with 40GB+ VRAM - Recommended: RTX A6000 (48GB), A100 (40/80GB) - Alternative: RTX 4090 (24GB) with memory optimizations ### Recommended Hardware - **GPU**: NVIDIA A6000 48GB, A100 40/80GB, or RTX 6000 Ada - **System RAM**: 64GB for optimal performance - **Storage**: NVMe SSD for faster model loading - **VRAM Optimization**: Enable gradient checkpointing and attention slicing for 24GB GPUs ## Usage Examples ### Basic Image-to-Video Generation ```python from diffusers import DiffusionPipeline, AutoencoderKL import torch from PIL import Image # Load the WAN 2.1 720p FP16 model pipe = DiffusionPipeline.from_single_file( "E:/huggingface/wan21-fp16-720p/diffusion_models/wan/wan21-i2v-720p-14b-fp16.safetensors", torch_dtype=torch.float16, use_safetensors=True ) # Load WAN 2.1 VAE (must be downloaded separately) pipe.vae = AutoencoderKL.from_single_file( "E:/huggingface/wan21-vae/vae/wan/wan21-vae.safetensors" ) # Move to GPU pipe.to("cuda") # Load input image input_image = Image.open("path/to/your/image.jpg") # Generate video from image video_frames = pipe( image=input_image, prompt="cinematic video with smooth motion", num_frames=24, num_inference_steps=50, guidance_scale=7.5 ).frames[0] # Export video from diffusers.utils import export_to_video export_to_video(video_frames, "output_video.mp4", fps=8) ``` ### Memory-Optimized Usage (for 24GB GPUs) ```python from diffusers import DiffusionPipeline, AutoencoderKL import torch # Load model with memory optimizations pipe = DiffusionPipeline.from_single_file( "E:/huggingface/wan21-fp16-720p/diffusion_models/wan/wan21-i2v-720p-14b-fp16.safetensors", torch_dtype=torch.float16, use_safetensors=True ) # Enable memory-efficient attention pipe.enable_attention_slicing(1) pipe.enable_vae_slicing() # Enable gradient checkpointing (if supported) if hasattr(pipe.unet, 'enable_gradient_checkpointing'): pipe.unet.enable_gradient_checkpointing() pipe.to("cuda") # Generate with reduced memory footprint video_frames = pipe( image=input_image, prompt="your prompt here", num_frames=16, # Reduce frames for lower VRAM num_inference_steps=40, guidance_scale=7.5 ).frames[0] ``` ### Using with Camera Control LoRAs (Optional) ```python # Load camera control LoRA (must be downloaded separately) pipe.load_lora_weights( "E:/huggingface/wan21-loras/loras/wan/wan21-camera-rotation-rank16-v1.safetensors" ) # Generate video with camera movement video_frames = pipe( image=input_image, prompt="rotating camera around the subject, cinematic", num_frames=24, num_inference_steps=50, guidance_scale=7.5 ).frames[0] ``` ## Model Specifications - **Architecture**: Transformer-based diffusion model for image-to-video generation - **Parameters**: 14 billion (14B) - **Precision**: FP16 (16-bit floating point) - 1 sign bit, 5-bit exponent, 10-bit mantissa - Full numerical precision for maximum quality - **Resolution**: 720p (1280x720) - **Format**: SafeTensors (secure and efficient serialization) - **Model Type**: Image-to-Video (I2V) - **Framework Compatibility**: diffusers, PyTorch 2.0+ ## Performance Tips 1. **Resolution and Quality**: This 720p model provides maximum detail and clarity but requires significant VRAM 2. **Memory Optimization**: - Enable attention slicing: `pipe.enable_attention_slicing(1)` - Enable VAE slicing: `pipe.enable_vae_slicing()` - Reduce frame count: Use 16-24 frames instead of 32+ 3. **Inference Speed**: - Use fp16 dtype for optimal GPU utilization - Reduce inference steps (30-40) for faster generation with minimal quality loss 4. **Prompt Engineering**: - Be specific about desired motion: "slow panning", "gentle zoom", "smooth transition" - Include cinematic keywords: "cinematic", "smooth", "professional" - Specify camera movements if using LoRAs: "rotating camera", "aerial view" 5. **Batch Generation**: Process single videos at a time due to high VRAM requirements 6. **Storage**: Use NVMe SSD for faster model loading times ## Installation Requirements ```bash # Install required dependencies pip install diffusers transformers accelerate safetensors torch torchvision # For video export functionality pip install opencv-python imageio imageio-ffmpeg ``` **Python Environment**: - Python 3.8+ - PyTorch 2.0+ with CUDA support - diffusers >= 0.21.0 - transformers - accelerate - safetensors ## Version Comparison ### WAN 2.1 Precision Variants | Variant | Size | VRAM | Quality | Speed | Use Case | |---------|------|------|---------|-------|----------| | **FP16 (this model)** | 31 GB | 40GB+ | Maximum | Standard | Research, archival quality, maximum fidelity | | FP8 | 16 GB | 24GB+ | High | Faster | Production deployment, efficient inference | ### Resolution Variants | Resolution | Model Size | VRAM | Quality | Details | |------------|------------|------|---------|---------| | 480p | 31 GB | 32GB+ | High | Balanced quality/performance | | **720p (this model)** | 31 GB | 40GB+ | Maximum | Enhanced detail and clarity | **When to use FP16 720p**: - Maximum quality requirements - Research and development - Professional/commercial production - Archival and reference generation - GPU with 40GB+ VRAM available **Consider alternatives if**: - VRAM limited to 24GB or less (use FP8 or 480p) - Inference speed is critical (use FP8) - Running on consumer GPUs (use FP8 480p) ## License This model is released under a custom WAN license. Please review the license terms before use: - Commercial use restrictions may apply - Attribution requirements may be specified - Refer to official WAN documentation for complete license terms **License Type**: `other` (Custom WAN License) ## Citation If you use this model in your research or projects, please cite: ```bibtex @software{wan21_fp16_720p, title={WAN 2.1 FP16 720p: High-Fidelity Image-to-Video Generation}, year={2024}, note={14B parameter transformer-based diffusion model for 720p video generation in full FP16 precision} } ``` ## Related Resources ### Official Resources - **WAN Project**: Official model documentation and updates - **Hugging Face Model Hub**: Community-shared models and discussions - **diffusers Documentation**: https://huggingface.co/docs/diffusers ### Related Models - **WAN 2.1 480p FP16**: Lower resolution variant with same precision (32GB VRAM) - **WAN 2.1 FP8**: Quantized models for efficient deployment (24GB VRAM) - **WAN 2.2**: Next generation with enhanced features and quality improvements - **WAN 2.1 VAE**: Required for complete functionality (download separately) - **WAN Camera Control LoRAs**: Optional adapters for cinematic camera movements ### Complementary Components - **VAE**: `wan21-vae.safetensors` (~243 MB, required) - **Camera LoRAs** (optional): - Rotation LoRA: Orbital camera movements - Arc Shot LoRA: Curved dolly movements - Drone LoRA: Aerial perspectives ## Technical Notes ### FP16 Precision Characteristics - **Numerical Range**: ±65,504 (max value) - **Precision**: ~3-4 decimal digits - **Advantages**: - Maximum generation quality - No quantization artifacts - Broad hardware support - Research standard - **Trade-offs**: - 2x size vs FP8 - Higher VRAM requirements - Slightly slower on tensor core GPUs ### Model Architecture Details - **Type**: Transformer-based diffusion model - **Conditioning**: Text and image conditioning - **Temporal Modeling**: Attention mechanisms across frames - **Latent Space**: Works in VAE latent space for efficiency - **Denoising Schedule**: Learned diffusion schedule ### Compatibility Notes - Requires PyTorch with FP16 support (all modern versions) - Compatible with CUDA compute capability 6.0+ (Pascal and newer) - Works with mixed precision training/inference - Supports gradient checkpointing for memory efficiency ## Troubleshooting ### Out of Memory Errors 1. Enable attention slicing: `pipe.enable_attention_slicing(1)` 2. Enable VAE slicing: `pipe.enable_vae_slicing()` 3. Reduce frame count to 16-24 frames 4. Reduce inference steps to 30-40 5. Consider using the 480p variant or FP8 quantized model ### Slow Generation Speed 1. Ensure model is on GPU: `pipe.to("cuda")` 2. Use FP16 dtype: `torch_dtype=torch.float16` 3. Reduce inference steps (minimal quality impact at 30-40 steps) 4. Use faster scheduler (DPM-Solver++ or DDIM) 5. Consider FP8 variant for production deployment ### Quality Issues 1. Increase inference steps (50-80 for maximum quality) 2. Adjust guidance scale (7.0-8.5 recommended range) 3. Use more descriptive prompts with motion details 4. Ensure proper VAE is loaded 5. Check input image quality and resolution ## Changelog ### Version v1.0 (Initial Release) - Initial README creation for WAN 2.1 720p FP16 model - Comprehensive documentation of model specifications - Usage examples with memory optimization - Hardware requirements and performance tips - Troubleshooting guide and compatibility notes --- **Model Status**: Production-ready for research and high-quality video generation **Last Updated**: 2025-10-13 **Maintained By**: Community documentation (unofficial) **Ethical Use**: Please use this model responsibly and in accordance with ethical AI guidelines. Be mindful of: - Content authenticity and disclosure when using AI-generated videos - Respect for intellectual property and likeness rights - Potential misuse for deepfakes or misleading content - Environmental impact of large model inference For questions, issues, or contributions to this documentation, please refer to the Hugging Face community forums and official WAN project resources.