WAN 2.2 FP32 Text Encoders (GGUF)
High-precision FP32 text encoder models in GGUF format for the WAN (World Animation Network) 2.2 video generation system. This repository contains the UMT5-XXL encoder optimized for text-to-video generation tasks.
Model Description
The WAN 2.2 FP32 encoders provide maximum precision text understanding for video generation workflows. The UMT5-XXL (Unified Multilingual T5) encoder processes text prompts and converts them into high-dimensional embeddings that guide the video generation process.
Key Features:
- Full FP32 precision for maximum text understanding accuracy
- GGUF format for efficient loading and memory management
- Optimized for WAN 2.2 video generation pipeline
- Supports complex, detailed text prompts for video generation
- Compatible with diffusers library integration
Repository Contents
wan22-fp32-encoders-gguf/
βββ text_encoders/
βββ umt5-xxl-encoder-f32.gguf (22 GB)
Total Repository Size: ~22 GB
File Details
| File | Size | Format | Precision | Purpose |
|---|---|---|---|---|
text_encoders/umt5-xxl-encoder-f32.gguf |
22 GB | GGUF | FP32 | Text prompt encoding |
Hardware Requirements
Minimum Requirements
- VRAM: 24 GB (for encoder alone)
- RAM: 32 GB system memory
- Disk Space: 25 GB free space
- GPU: NVIDIA RTX 4090, A5000, or equivalent
Recommended Requirements
- VRAM: 32+ GB (for complete WAN pipeline)
- RAM: 64 GB system memory
- Disk Space: 50+ GB for complete WAN setup
- GPU: NVIDIA RTX 6000 Ada, A6000, or H100
Performance Notes
- FP32 precision requires significantly more VRAM than FP16/FP8 variants
- Consider using lower precision encoders (FP16/FP8) if VRAM is limited
- Full precision provides best text understanding but with higher memory cost
Usage Examples
Loading with Diffusers
from diffusers import DiffusionPipeline
import torch
# Load WAN pipeline with custom FP32 encoder
pipe = DiffusionPipeline.from_pretrained(
"wan/wan-2.2",
torch_dtype=torch.float32,
custom_pipeline="wan_pipeline"
)
# Load FP32 text encoder from local path
pipe.load_text_encoder(
"E:/huggingface/wan22-fp32-encoders-gguf/text_encoders/umt5-xxl-encoder-f32.gguf",
precision="fp32"
)
pipe = pipe.to("cuda")
# Generate video with high-precision text understanding
prompt = "A cinematic shot of a tiger running through a dense jungle at sunset, dynamic camera following action"
video_frames = pipe(
prompt=prompt,
num_frames=120,
height=720,
width=1280,
guidance_scale=7.5,
num_inference_steps=50
).frames
# Save video
from diffusers.utils import export_to_video
export_to_video(video_frames, "output.mp4", fps=24)
Advanced Usage with Custom Pipeline
from transformers import AutoTokenizer, AutoModel
import torch
# Direct encoder loading for custom pipelines
tokenizer = AutoTokenizer.from_pretrained("google/umt5-xxl")
# Load GGUF encoder (requires gguf-compatible loader)
from gguf_loader import load_gguf_model
encoder = load_gguf_model(
"E:/huggingface/wan22-fp32-encoders-gguf/text_encoders/umt5-xxl-encoder-f32.gguf",
device="cuda",
torch_dtype=torch.float32
)
# Encode text prompt
prompt = "Epic aerial view of a futuristic city at night with neon lights"
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")
with torch.no_grad():
text_embeddings = encoder(**inputs).last_hidden_state
# Use embeddings in video generation pipeline
# ... (integrate with WAN diffusion model)
Memory-Efficient Loading
import torch
from diffusers import DiffusionPipeline
# Load with CPU offloading for lower VRAM usage
pipe = DiffusionPipeline.from_pretrained(
"wan/wan-2.2",
torch_dtype=torch.float32
)
pipe.enable_model_cpu_offload() # Offload to CPU when not in use
pipe.enable_attention_slicing() # Reduce memory during attention
pipe.load_text_encoder(
"E:/huggingface/wan22-fp32-encoders-gguf/text_encoders/umt5-xxl-encoder-f32.gguf"
)
# Generate with memory optimization
video = pipe(
"A serene mountain landscape with flowing waterfall",
num_frames=60,
height=512,
width=896
).frames
Model Specifications
Architecture Details
| Component | Specification |
|---|---|
| Base Model | UMT5-XXL (Unified Multilingual T5) |
| Precision | FP32 (32-bit floating point) |
| Format | GGUF (GPT-Generated Unified Format) |
| Parameters | ~11 billion parameters |
| Context Length | 512 tokens |
| Hidden Size | 4096 dimensions |
| Encoder Layers | 24 transformer layers |
| Attention Heads | 64 attention heads |
Precision Comparison
| Precision | Size | VRAM | Accuracy | Speed |
|---|---|---|---|---|
| FP32 (this model) | 22 GB | 24 GB | Highest | Slower |
| FP16 | 11 GB | 12 GB | High | Medium |
| FP8 | 5.5 GB | 6 GB | Good | Faster |
GGUF Format Benefits
- Efficient Loading: Lazy loading and memory mapping support
- Cross-Platform: Compatible with various inference engines
- Optimized Storage: Compressed tensor storage with minimal quality loss
- Flexibility: Easy integration with custom pipelines
Performance Tips and Optimization
Memory Optimization
- Use CPU Offloading: Enable
enable_model_cpu_offload()for lower VRAM - Attention Slicing: Use
enable_attention_slicing()to reduce memory peaks - VAE Tiling: For long videos, enable VAE tiling to process in chunks
- Batch Size: Keep batch size to 1 for FP32 encoder on 24GB VRAM
Quality vs Performance
- Maximum Quality: FP32 encoder + FP32 diffusion model (requires 48+ GB VRAM)
- Balanced: FP32 encoder + FP16 diffusion model (requires 32 GB VRAM)
- Efficient: FP16 encoder + FP16 diffusion model (requires 16 GB VRAM)
Prompt Engineering Tips
- Detailed Descriptions: FP32 precision excels with complex, detailed prompts
- Cinematic Language: Use film terminology for better camera control
- Scene Composition: Describe foreground, midground, background elements
- Motion Description: Specify camera movement and subject actions clearly
- Lighting Details: Describe lighting conditions for enhanced visual quality
Recommended Settings
# High-quality video generation settings
generation_config = {
"num_frames": 120, # 5 seconds at 24fps
"height": 720, # 720p resolution
"width": 1280, # 16:9 aspect ratio
"guidance_scale": 7.5, # Balanced prompt adherence
"num_inference_steps": 50, # High quality (slower)
"fps": 24, # Cinematic frame rate
}
# Fast preview settings
preview_config = {
"num_frames": 60, # 2.5 seconds at 24fps
"height": 512, # Lower resolution
"width": 896, # 16:9 aspect ratio
"guidance_scale": 7.0, # Slightly lower
"num_inference_steps": 25, # Faster generation
"fps": 24,
}
License
This model is released under the WAN License. Please review the license terms before use:
- Non-Commercial Use: Permitted for research and personal projects
- Commercial Use: Requires separate licensing agreement
- Attribution: Required in derivative works
- Redistribution: Allowed with proper attribution and license inclusion
For commercial licensing inquiries, please contact the WAN development team.
Citation
If you use these encoders in your research or projects, please cite:
@misc{wan22-fp32-encoders,
title={WAN 2.2 FP32 Text Encoders for Video Generation},
author={WAN Team},
year={2024},
howpublished={\url{https://huggingface.co/wan/wan-2.2-fp32-encoders-gguf}},
note={UMT5-XXL text encoder in GGUF format for high-precision video generation}
}
Related Resources
Official Links
- WAN Homepage: https://world-animation.net
- Model Card: https://huggingface.co/wan/wan-2.2
- Documentation: https://docs.world-animation.net
- Paper: "WAN: World Animation Network for Text-to-Video Generation"
Related Models
- WAN 2.2 Base: Complete video generation model
- WAN 2.2 FP16 Encoders: Lower precision for reduced VRAM usage
- WAN 2.2 VAE: Video autoencoder for latent space processing
- WAN Camera LoRAs: Camera control enhancement modules
Community
- Discord: WAN Community Server
- GitHub: https://github.com/wan-team/wan
- Forums: https://discuss.world-animation.net
Troubleshooting
Common Issues
Out of Memory (OOM) Errors:
- Reduce resolution (720p β 512p)
- Lower frame count (120 β 60 frames)
- Enable CPU offloading and attention slicing
- Consider using FP16 encoder variant instead
Slow Generation Speed:
- FP32 is inherently slower than FP16/FP8
- Reduce
num_inference_steps(50 β 30) - Use smaller resolution for previews
- Ensure CUDA is properly installed and utilized
Loading Errors:
- Verify GGUF loader compatibility
- Check file integrity (22 GB expected size)
- Ensure sufficient disk space and RAM
- Update diffusers and transformers libraries
Quality Issues:
- Increase
guidance_scale(7.5 β 9.0) for stronger prompt adherence - Use more detailed, descriptive prompts
- Increase
num_inference_stepsfor better quality - Check that FP32 precision is actually being used
Support
For issues, questions, or contributions:
- Issues: GitHub Issues
- Discussions: Hugging Face Discussions
- Email: [email protected]
Model Version: 2.2 Last Updated: 2024-08-12 README Version: v1.0 Maintained by: WAN Development Team
- Downloads last month
- 65
32-bit