---
license: other
library_name: diffusers
pipeline_tag: image-to-video
tags:
  - wan
  - image-to-video
  - video-generation
---

<!-- README Version: v1.3 -->

# WAN 2.1 FP16 - 720p Image-to-Video Model

High-fidelity 720p image-to-video generation model in full FP16 precision for maximum quality output.

## Model Description

WAN (Wan An) 2.1 is a state-of-the-art transformer-based diffusion model for image-to-video generation. This repository contains the **720p variant** in full FP16 (16-bit floating point) precision, providing the highest quality video generation with enhanced detail and clarity. The model transforms static images into dynamic video sequences with temporal consistency and cinematic quality.

**Key Capabilities**:
- Image-to-video generation at 720p resolution
- 14 billion parameter transformer architecture
- Full FP16 precision for maximum generation quality
- High temporal consistency across frames
- Compatible with camera control LoRAs (available separately)

## Repository Contents

```
wan21-fp16-720p/
└── diffusion_models/
    └── wan/
        └── wan21-i2v-720p-14b-fp16.safetensors  (31 GB)
```

**Total Repository Size**: ~31 GB

### Model Files

| File | Size | Description |
|------|------|-------------|
| `diffusion_models/wan/wan21-i2v-720p-14b-fp16.safetensors` | 31 GB | WAN 2.1 Image-to-Video 720p transformer model (14B parameters, FP16) |

**Note**: This repository contains **only the 720p diffusion model**. For complete functionality, you will also need:
- **WAN 2.1 VAE** (available separately, ~243 MB)
- **Camera Control LoRAs** (optional, for cinematic camera movements, ~343 MB each)

## Hardware Requirements

### Minimum Requirements
- **VRAM**: 40GB+ (FP16 full precision at 720p)
- **Disk Space**: 31 GB for model storage
- **System RAM**: 32GB+ recommended
- **GPU**: High-end NVIDIA GPU with 40GB+ VRAM
  - Recommended: RTX A6000 (48GB), A100 (40/80GB)
  - Alternative: RTX 4090 (24GB) with memory optimizations

### Recommended Hardware
- **GPU**: NVIDIA A6000 48GB, A100 40/80GB, or RTX 6000 Ada
- **System RAM**: 64GB for optimal performance
- **Storage**: NVMe SSD for faster model loading
- **VRAM Optimization**: Enable gradient checkpointing and attention slicing for 24GB GPUs

## Usage Examples

### Basic Image-to-Video Generation

```python
from diffusers import DiffusionPipeline, AutoencoderKL
import torch
from PIL import Image

# Load the WAN 2.1 720p FP16 model
pipe = DiffusionPipeline.from_single_file(
    "E:/huggingface/wan21-fp16-720p/diffusion_models/wan/wan21-i2v-720p-14b-fp16.safetensors",
    torch_dtype=torch.float16,
    use_safetensors=True
)

# Load WAN 2.1 VAE (must be downloaded separately)
pipe.vae = AutoencoderKL.from_single_file(
    "E:/huggingface/wan21-vae/vae/wan/wan21-vae.safetensors"
)

# Move to GPU
pipe.to("cuda")

# Load input image
input_image = Image.open("path/to/your/image.jpg")

# Generate video from image
video_frames = pipe(
    image=input_image,
    prompt="cinematic video with smooth motion",
    num_frames=24,
    num_inference_steps=50,
    guidance_scale=7.5
).frames[0]

# Export video
from diffusers.utils import export_to_video
export_to_video(video_frames, "output_video.mp4", fps=8)
```

### Memory-Optimized Usage (for 24GB GPUs)

```python
from diffusers import DiffusionPipeline, AutoencoderKL
import torch

# Load model with memory optimizations
pipe = DiffusionPipeline.from_single_file(
    "E:/huggingface/wan21-fp16-720p/diffusion_models/wan/wan21-i2v-720p-14b-fp16.safetensors",
    torch_dtype=torch.float16,
    use_safetensors=True
)

# Enable memory-efficient attention
pipe.enable_attention_slicing(1)
pipe.enable_vae_slicing()

# Enable gradient checkpointing (if supported)
if hasattr(pipe.unet, 'enable_gradient_checkpointing'):
    pipe.unet.enable_gradient_checkpointing()

pipe.to("cuda")

# Generate with reduced memory footprint
video_frames = pipe(
    image=input_image,
    prompt="your prompt here",
    num_frames=16,  # Reduce frames for lower VRAM
    num_inference_steps=40,
    guidance_scale=7.5
).frames[0]
```

### Using with Camera Control LoRAs (Optional)

```python
# Load camera control LoRA (must be downloaded separately)
pipe.load_lora_weights(
    "E:/huggingface/wan21-loras/loras/wan/wan21-camera-rotation-rank16-v1.safetensors"
)

# Generate video with camera movement
video_frames = pipe(
    image=input_image,
    prompt="rotating camera around the subject, cinematic",
    num_frames=24,
    num_inference_steps=50,
    guidance_scale=7.5
).frames[0]
```

## Model Specifications

- **Architecture**: Transformer-based diffusion model for image-to-video generation
- **Parameters**: 14 billion (14B)
- **Precision**: FP16 (16-bit floating point)
  - 1 sign bit, 5-bit exponent, 10-bit mantissa
  - Full numerical precision for maximum quality
- **Resolution**: 720p (1280x720)
- **Format**: SafeTensors (secure and efficient serialization)
- **Model Type**: Image-to-Video (I2V)
- **Framework Compatibility**: diffusers, PyTorch 2.0+

## Performance Tips

1. **Resolution and Quality**: This 720p model provides maximum detail and clarity but requires significant VRAM
2. **Memory Optimization**:
   - Enable attention slicing: `pipe.enable_attention_slicing(1)`
   - Enable VAE slicing: `pipe.enable_vae_slicing()`
   - Reduce frame count: Use 16-24 frames instead of 32+
3. **Inference Speed**:
   - Use fp16 dtype for optimal GPU utilization
   - Reduce inference steps (30-40) for faster generation with minimal quality loss
4. **Prompt Engineering**:
   - Be specific about desired motion: "slow panning", "gentle zoom", "smooth transition"
   - Include cinematic keywords: "cinematic", "smooth", "professional"
   - Specify camera movements if using LoRAs: "rotating camera", "aerial view"
5. **Batch Generation**: Process single videos at a time due to high VRAM requirements
6. **Storage**: Use NVMe SSD for faster model loading times

## Installation Requirements

```bash
# Install required dependencies
pip install diffusers transformers accelerate safetensors torch torchvision

# For video export functionality
pip install opencv-python imageio imageio-ffmpeg
```

**Python Environment**:
- Python 3.8+
- PyTorch 2.0+ with CUDA support
- diffusers >= 0.21.0
- transformers
- accelerate
- safetensors

## Version Comparison

### WAN 2.1 Precision Variants

| Variant | Size | VRAM | Quality | Speed | Use Case |
|---------|------|------|---------|-------|----------|
| **FP16 (this model)** | 31 GB | 40GB+ | Maximum | Standard | Research, archival quality, maximum fidelity |
| FP8 | 16 GB | 24GB+ | High | Faster | Production deployment, efficient inference |

### Resolution Variants

| Resolution | Model Size | VRAM | Quality | Details |
|------------|------------|------|---------|---------|
| 480p | 31 GB | 32GB+ | High | Balanced quality/performance |
| **720p (this model)** | 31 GB | 40GB+ | Maximum | Enhanced detail and clarity |

**When to use FP16 720p**:
- Maximum quality requirements
- Research and development
- Professional/commercial production
- Archival and reference generation
- GPU with 40GB+ VRAM available

**Consider alternatives if**:
- VRAM limited to 24GB or less (use FP8 or 480p)
- Inference speed is critical (use FP8)
- Running on consumer GPUs (use FP8 480p)

## License

This model is released under a custom WAN license. Please review the license terms before use:
- Commercial use restrictions may apply
- Attribution requirements may be specified
- Refer to official WAN documentation for complete license terms

**License Type**: `other` (Custom WAN License)

## Citation

If you use this model in your research or projects, please cite:

```bibtex
@software{wan21_fp16_720p,
  title={WAN 2.1 FP16 720p: High-Fidelity Image-to-Video Generation},
  year={2024},
  note={14B parameter transformer-based diffusion model for 720p video generation in full FP16 precision}
}
```

## Related Resources

### Official Resources
- **WAN Project**: Official model documentation and updates
- **Hugging Face Model Hub**: Community-shared models and discussions
- **diffusers Documentation**: https://huggingface.co/docs/diffusers

### Related Models
- **WAN 2.1 480p FP16**: Lower resolution variant with same precision (32GB VRAM)
- **WAN 2.1 FP8**: Quantized models for efficient deployment (24GB VRAM)
- **WAN 2.2**: Next generation with enhanced features and quality improvements
- **WAN 2.1 VAE**: Required for complete functionality (download separately)
- **WAN Camera Control LoRAs**: Optional adapters for cinematic camera movements

### Complementary Components
- **VAE**: `wan21-vae.safetensors` (~243 MB, required)
- **Camera LoRAs** (optional):
  - Rotation LoRA: Orbital camera movements
  - Arc Shot LoRA: Curved dolly movements
  - Drone LoRA: Aerial perspectives

## Technical Notes

### FP16 Precision Characteristics
- **Numerical Range**: ±65,504 (max value)
- **Precision**: ~3-4 decimal digits
- **Advantages**:
  - Maximum generation quality
  - No quantization artifacts
  - Broad hardware support
  - Research standard
- **Trade-offs**:
  - 2x size vs FP8
  - Higher VRAM requirements
  - Slightly slower on tensor core GPUs

### Model Architecture Details
- **Type**: Transformer-based diffusion model
- **Conditioning**: Text and image conditioning
- **Temporal Modeling**: Attention mechanisms across frames
- **Latent Space**: Works in VAE latent space for efficiency
- **Denoising Schedule**: Learned diffusion schedule

### Compatibility Notes
- Requires PyTorch with FP16 support (all modern versions)
- Compatible with CUDA compute capability 6.0+ (Pascal and newer)
- Works with mixed precision training/inference
- Supports gradient checkpointing for memory efficiency

## Troubleshooting

### Out of Memory Errors
1. Enable attention slicing: `pipe.enable_attention_slicing(1)`
2. Enable VAE slicing: `pipe.enable_vae_slicing()`
3. Reduce frame count to 16-24 frames
4. Reduce inference steps to 30-40
5. Consider using the 480p variant or FP8 quantized model

### Slow Generation Speed
1. Ensure model is on GPU: `pipe.to("cuda")`
2. Use FP16 dtype: `torch_dtype=torch.float16`
3. Reduce inference steps (minimal quality impact at 30-40 steps)
4. Use faster scheduler (DPM-Solver++ or DDIM)
5. Consider FP8 variant for production deployment

### Quality Issues
1. Increase inference steps (50-80 for maximum quality)
2. Adjust guidance scale (7.0-8.5 recommended range)
3. Use more descriptive prompts with motion details
4. Ensure proper VAE is loaded
5. Check input image quality and resolution

## Changelog

### Version v1.0 (Initial Release)
- Initial README creation for WAN 2.1 720p FP16 model
- Comprehensive documentation of model specifications
- Usage examples with memory optimization
- Hardware requirements and performance tips
- Troubleshooting guide and compatibility notes

---

**Model Status**: Production-ready for research and high-quality video generation
**Last Updated**: 2025-10-13
**Maintained By**: Community documentation (unofficial)

**Ethical Use**: Please use this model responsibly and in accordance with ethical AI guidelines. Be mindful of:
- Content authenticity and disclosure when using AI-generated videos
- Respect for intellectual property and likeness rights
- Potential misuse for deepfakes or misleading content
- Environmental impact of large model inference

For questions, issues, or contributions to this documentation, please refer to the Hugging Face community forums and official WAN project resources.