WAN 2.2 FP32 Text Encoders (GGUF)

High-precision FP32 text encoder models in GGUF format for the WAN (World Animation Network) 2.2 video generation system. This repository contains the UMT5-XXL encoder optimized for text-to-video generation tasks.

Model Description

The WAN 2.2 FP32 encoders provide maximum precision text understanding for video generation workflows. The UMT5-XXL (Unified Multilingual T5) encoder processes text prompts and converts them into high-dimensional embeddings that guide the video generation process.

Key Features:

Full FP32 precision for maximum text understanding accuracy
GGUF format for efficient loading and memory management
Optimized for WAN 2.2 video generation pipeline
Supports complex, detailed text prompts for video generation
Compatible with diffusers library integration

Repository Contents

wan22-fp32-encoders-gguf/
└── text_encoders/
    └── umt5-xxl-encoder-f32.gguf (22 GB)

Total Repository Size: ~22 GB

File Details

File	Size	Format	Precision	Purpose
`text_encoders/umt5-xxl-encoder-f32.gguf`	22 GB	GGUF	FP32	Text prompt encoding

Hardware Requirements

Minimum Requirements

VRAM: 24 GB (for encoder alone)
RAM: 32 GB system memory
Disk Space: 25 GB free space
GPU: NVIDIA RTX 4090, A5000, or equivalent

Recommended Requirements

VRAM: 32+ GB (for complete WAN pipeline)
RAM: 64 GB system memory
Disk Space: 50+ GB for complete WAN setup
GPU: NVIDIA RTX 6000 Ada, A6000, or H100

Performance Notes

FP32 precision requires significantly more VRAM than FP16/FP8 variants
Consider using lower precision encoders (FP16/FP8) if VRAM is limited
Full precision provides best text understanding but with higher memory cost

Usage Examples

Loading with Diffusers

from diffusers import DiffusionPipeline
import torch

# Load WAN pipeline with custom FP32 encoder
pipe = DiffusionPipeline.from_pretrained(
    "wan/wan-2.2",
    torch_dtype=torch.float32,
    custom_pipeline="wan_pipeline"
)

# Load FP32 text encoder from local path
pipe.load_text_encoder(
    "E:/huggingface/wan22-fp32-encoders-gguf/text_encoders/umt5-xxl-encoder-f32.gguf",
    precision="fp32"
)

pipe = pipe.to("cuda")

# Generate video with high-precision text understanding
prompt = "A cinematic shot of a tiger running through a dense jungle at sunset, dynamic camera following action"
video_frames = pipe(
    prompt=prompt,
    num_frames=120,
    height=720,
    width=1280,
    guidance_scale=7.5,
    num_inference_steps=50
).frames

# Save video
from diffusers.utils import export_to_video
export_to_video(video_frames, "output.mp4", fps=24)

Advanced Usage with Custom Pipeline

from transformers import AutoTokenizer, AutoModel
import torch

# Direct encoder loading for custom pipelines
tokenizer = AutoTokenizer.from_pretrained("google/umt5-xxl")

# Load GGUF encoder (requires gguf-compatible loader)
from gguf_loader import load_gguf_model

encoder = load_gguf_model(
    "E:/huggingface/wan22-fp32-encoders-gguf/text_encoders/umt5-xxl-encoder-f32.gguf",
    device="cuda",
    torch_dtype=torch.float32
)

# Encode text prompt
prompt = "Epic aerial view of a futuristic city at night with neon lights"
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")

with torch.no_grad():
    text_embeddings = encoder(**inputs).last_hidden_state

# Use embeddings in video generation pipeline
# ... (integrate with WAN diffusion model)

Memory-Efficient Loading

import torch
from diffusers import DiffusionPipeline

# Load with CPU offloading for lower VRAM usage
pipe = DiffusionPipeline.from_pretrained(
    "wan/wan-2.2",
    torch_dtype=torch.float32
)

pipe.enable_model_cpu_offload()  # Offload to CPU when not in use
pipe.enable_attention_slicing()   # Reduce memory during attention

pipe.load_text_encoder(
    "E:/huggingface/wan22-fp32-encoders-gguf/text_encoders/umt5-xxl-encoder-f32.gguf"
)

# Generate with memory optimization
video = pipe(
    "A serene mountain landscape with flowing waterfall",
    num_frames=60,
    height=512,
    width=896
).frames

Model Specifications

Architecture Details

Component	Specification
Base Model	UMT5-XXL (Unified Multilingual T5)
Precision	FP32 (32-bit floating point)
Format	GGUF (GPT-Generated Unified Format)
Parameters	~11 billion parameters
Context Length	512 tokens
Hidden Size	4096 dimensions
Encoder Layers	24 transformer layers
Attention Heads	64 attention heads

Precision Comparison

Precision	Size	VRAM	Accuracy	Speed
FP32 (this model)	22 GB	24 GB	Highest	Slower
FP16	11 GB	12 GB	High	Medium
FP8	5.5 GB	6 GB	Good	Faster

GGUF Format Benefits

Efficient Loading: Lazy loading and memory mapping support
Cross-Platform: Compatible with various inference engines
Optimized Storage: Compressed tensor storage with minimal quality loss
Flexibility: Easy integration with custom pipelines

Performance Tips and Optimization

Memory Optimization

Use CPU Offloading: Enable enable_model_cpu_offload() for lower VRAM
Attention Slicing: Use enable_attention_slicing() to reduce memory peaks
VAE Tiling: For long videos, enable VAE tiling to process in chunks
Batch Size: Keep batch size to 1 for FP32 encoder on 24GB VRAM

Quality vs Performance

Maximum Quality: FP32 encoder + FP32 diffusion model (requires 48+ GB VRAM)
Balanced: FP32 encoder + FP16 diffusion model (requires 32 GB VRAM)
Efficient: FP16 encoder + FP16 diffusion model (requires 16 GB VRAM)

Prompt Engineering Tips

Detailed Descriptions: FP32 precision excels with complex, detailed prompts
Cinematic Language: Use film terminology for better camera control
Scene Composition: Describe foreground, midground, background elements
Motion Description: Specify camera movement and subject actions clearly
Lighting Details: Describe lighting conditions for enhanced visual quality

Recommended Settings

# High-quality video generation settings
generation_config = {
    "num_frames": 120,           # 5 seconds at 24fps
    "height": 720,               # 720p resolution
    "width": 1280,               # 16:9 aspect ratio
    "guidance_scale": 7.5,       # Balanced prompt adherence
    "num_inference_steps": 50,   # High quality (slower)
    "fps": 24,                   # Cinematic frame rate
}

# Fast preview settings
preview_config = {
    "num_frames": 60,            # 2.5 seconds at 24fps
    "height": 512,               # Lower resolution
    "width": 896,                # 16:9 aspect ratio
    "guidance_scale": 7.0,       # Slightly lower
    "num_inference_steps": 25,   # Faster generation
    "fps": 24,
}

License

This model is released under the WAN License. Please review the license terms before use:

Non-Commercial Use: Permitted for research and personal projects
Commercial Use: Requires separate licensing agreement
Attribution: Required in derivative works
Redistribution: Allowed with proper attribution and license inclusion

For commercial licensing inquiries, please contact the WAN development team.

Citation

If you use these encoders in your research or projects, please cite:

@misc{wan22-fp32-encoders,
  title={WAN 2.2 FP32 Text Encoders for Video Generation},
  author={WAN Team},
  year={2024},
  howpublished={\url{https://huggingface.co/wan/wan-2.2-fp32-encoders-gguf}},
  note={UMT5-XXL text encoder in GGUF format for high-precision video generation}
}

Related Resources

Official Links

WAN Homepage: https://world-animation.net
Model Card: https://huggingface.co/wan/wan-2.2
Documentation: https://docs.world-animation.net
Paper: "WAN: World Animation Network for Text-to-Video Generation"

Related Models

WAN 2.2 Base: Complete video generation model
WAN 2.2 FP16 Encoders: Lower precision for reduced VRAM usage
WAN 2.2 VAE: Video autoencoder for latent space processing
WAN Camera LoRAs: Camera control enhancement modules

Community

Discord: WAN Community Server
GitHub: https://github.com/wan-team/wan
Forums: https://discuss.world-animation.net

Troubleshooting

Common Issues

Out of Memory (OOM) Errors:

Reduce resolution (720p → 512p)
Lower frame count (120 → 60 frames)
Enable CPU offloading and attention slicing
Consider using FP16 encoder variant instead

Slow Generation Speed:

FP32 is inherently slower than FP16/FP8
Reduce num_inference_steps (50 → 30)
Use smaller resolution for previews
Ensure CUDA is properly installed and utilized

Loading Errors:

Verify GGUF loader compatibility
Check file integrity (22 GB expected size)
Ensure sufficient disk space and RAM
Update diffusers and transformers libraries

Quality Issues:

Increase guidance_scale (7.5 → 9.0) for stronger prompt adherence
Use more detailed, descriptive prompts
Increase num_inference_steps for better quality
Check that FP32 precision is actually being used

Support

For issues, questions, or contributions:

Issues: GitHub Issues
Discussions: Hugging Face Discussions
Email: [email protected]

Model Version: 2.2 Last Updated: 2024-08-12 README Version: v1.0 Maintained by: WAN Development Team

Downloads last month: 65

GGUF

Model size

6B params

Architecture

t5encoder

Hardware compatibility

32-bit

Collection including wangkanai/wan22-fp32-encoders-gguf

wan-2.2

Collection

WAN 2.2 video models • 27 items • Updated 14 days ago • 1