WAN 2.2 FP32 Text Encoders (GGUF)

High-precision FP32 text encoder models in GGUF format for the WAN (World Animation Network) 2.2 video generation system. This repository contains the UMT5-XXL encoder optimized for text-to-video generation tasks.

Model Description

The WAN 2.2 FP32 encoders provide maximum precision text understanding for video generation workflows. The UMT5-XXL (Unified Multilingual T5) encoder processes text prompts and converts them into high-dimensional embeddings that guide the video generation process.

Key Features:

  • Full FP32 precision for maximum text understanding accuracy
  • GGUF format for efficient loading and memory management
  • Optimized for WAN 2.2 video generation pipeline
  • Supports complex, detailed text prompts for video generation
  • Compatible with diffusers library integration

Repository Contents

wan22-fp32-encoders-gguf/
└── text_encoders/
    └── umt5-xxl-encoder-f32.gguf (22 GB)

Total Repository Size: ~22 GB

File Details

File Size Format Precision Purpose
text_encoders/umt5-xxl-encoder-f32.gguf 22 GB GGUF FP32 Text prompt encoding

Hardware Requirements

Minimum Requirements

  • VRAM: 24 GB (for encoder alone)
  • RAM: 32 GB system memory
  • Disk Space: 25 GB free space
  • GPU: NVIDIA RTX 4090, A5000, or equivalent

Recommended Requirements

  • VRAM: 32+ GB (for complete WAN pipeline)
  • RAM: 64 GB system memory
  • Disk Space: 50+ GB for complete WAN setup
  • GPU: NVIDIA RTX 6000 Ada, A6000, or H100

Performance Notes

  • FP32 precision requires significantly more VRAM than FP16/FP8 variants
  • Consider using lower precision encoders (FP16/FP8) if VRAM is limited
  • Full precision provides best text understanding but with higher memory cost

Usage Examples

Loading with Diffusers

from diffusers import DiffusionPipeline
import torch

# Load WAN pipeline with custom FP32 encoder
pipe = DiffusionPipeline.from_pretrained(
    "wan/wan-2.2",
    torch_dtype=torch.float32,
    custom_pipeline="wan_pipeline"
)

# Load FP32 text encoder from local path
pipe.load_text_encoder(
    "E:/huggingface/wan22-fp32-encoders-gguf/text_encoders/umt5-xxl-encoder-f32.gguf",
    precision="fp32"
)

pipe = pipe.to("cuda")

# Generate video with high-precision text understanding
prompt = "A cinematic shot of a tiger running through a dense jungle at sunset, dynamic camera following action"
video_frames = pipe(
    prompt=prompt,
    num_frames=120,
    height=720,
    width=1280,
    guidance_scale=7.5,
    num_inference_steps=50
).frames

# Save video
from diffusers.utils import export_to_video
export_to_video(video_frames, "output.mp4", fps=24)

Advanced Usage with Custom Pipeline

from transformers import AutoTokenizer, AutoModel
import torch

# Direct encoder loading for custom pipelines
tokenizer = AutoTokenizer.from_pretrained("google/umt5-xxl")

# Load GGUF encoder (requires gguf-compatible loader)
from gguf_loader import load_gguf_model

encoder = load_gguf_model(
    "E:/huggingface/wan22-fp32-encoders-gguf/text_encoders/umt5-xxl-encoder-f32.gguf",
    device="cuda",
    torch_dtype=torch.float32
)

# Encode text prompt
prompt = "Epic aerial view of a futuristic city at night with neon lights"
inputs = tokenizer(prompt, return_tensors="pt", padding=True).to("cuda")

with torch.no_grad():
    text_embeddings = encoder(**inputs).last_hidden_state

# Use embeddings in video generation pipeline
# ... (integrate with WAN diffusion model)

Memory-Efficient Loading

import torch
from diffusers import DiffusionPipeline

# Load with CPU offloading for lower VRAM usage
pipe = DiffusionPipeline.from_pretrained(
    "wan/wan-2.2",
    torch_dtype=torch.float32
)

pipe.enable_model_cpu_offload()  # Offload to CPU when not in use
pipe.enable_attention_slicing()   # Reduce memory during attention

pipe.load_text_encoder(
    "E:/huggingface/wan22-fp32-encoders-gguf/text_encoders/umt5-xxl-encoder-f32.gguf"
)

# Generate with memory optimization
video = pipe(
    "A serene mountain landscape with flowing waterfall",
    num_frames=60,
    height=512,
    width=896
).frames

Model Specifications

Architecture Details

Component Specification
Base Model UMT5-XXL (Unified Multilingual T5)
Precision FP32 (32-bit floating point)
Format GGUF (GPT-Generated Unified Format)
Parameters ~11 billion parameters
Context Length 512 tokens
Hidden Size 4096 dimensions
Encoder Layers 24 transformer layers
Attention Heads 64 attention heads

Precision Comparison

Precision Size VRAM Accuracy Speed
FP32 (this model) 22 GB 24 GB Highest Slower
FP16 11 GB 12 GB High Medium
FP8 5.5 GB 6 GB Good Faster

GGUF Format Benefits

  • Efficient Loading: Lazy loading and memory mapping support
  • Cross-Platform: Compatible with various inference engines
  • Optimized Storage: Compressed tensor storage with minimal quality loss
  • Flexibility: Easy integration with custom pipelines

Performance Tips and Optimization

Memory Optimization

  1. Use CPU Offloading: Enable enable_model_cpu_offload() for lower VRAM
  2. Attention Slicing: Use enable_attention_slicing() to reduce memory peaks
  3. VAE Tiling: For long videos, enable VAE tiling to process in chunks
  4. Batch Size: Keep batch size to 1 for FP32 encoder on 24GB VRAM

Quality vs Performance

  • Maximum Quality: FP32 encoder + FP32 diffusion model (requires 48+ GB VRAM)
  • Balanced: FP32 encoder + FP16 diffusion model (requires 32 GB VRAM)
  • Efficient: FP16 encoder + FP16 diffusion model (requires 16 GB VRAM)

Prompt Engineering Tips

  • Detailed Descriptions: FP32 precision excels with complex, detailed prompts
  • Cinematic Language: Use film terminology for better camera control
  • Scene Composition: Describe foreground, midground, background elements
  • Motion Description: Specify camera movement and subject actions clearly
  • Lighting Details: Describe lighting conditions for enhanced visual quality

Recommended Settings

# High-quality video generation settings
generation_config = {
    "num_frames": 120,           # 5 seconds at 24fps
    "height": 720,               # 720p resolution
    "width": 1280,               # 16:9 aspect ratio
    "guidance_scale": 7.5,       # Balanced prompt adherence
    "num_inference_steps": 50,   # High quality (slower)
    "fps": 24,                   # Cinematic frame rate
}

# Fast preview settings
preview_config = {
    "num_frames": 60,            # 2.5 seconds at 24fps
    "height": 512,               # Lower resolution
    "width": 896,                # 16:9 aspect ratio
    "guidance_scale": 7.0,       # Slightly lower
    "num_inference_steps": 25,   # Faster generation
    "fps": 24,
}

License

This model is released under the WAN License. Please review the license terms before use:

  • Non-Commercial Use: Permitted for research and personal projects
  • Commercial Use: Requires separate licensing agreement
  • Attribution: Required in derivative works
  • Redistribution: Allowed with proper attribution and license inclusion

For commercial licensing inquiries, please contact the WAN development team.

Citation

If you use these encoders in your research or projects, please cite:

@misc{wan22-fp32-encoders,
  title={WAN 2.2 FP32 Text Encoders for Video Generation},
  author={WAN Team},
  year={2024},
  howpublished={\url{https://huggingface.co/wan/wan-2.2-fp32-encoders-gguf}},
  note={UMT5-XXL text encoder in GGUF format for high-precision video generation}
}

Related Resources

Official Links

Related Models

  • WAN 2.2 Base: Complete video generation model
  • WAN 2.2 FP16 Encoders: Lower precision for reduced VRAM usage
  • WAN 2.2 VAE: Video autoencoder for latent space processing
  • WAN Camera LoRAs: Camera control enhancement modules

Community

Troubleshooting

Common Issues

Out of Memory (OOM) Errors:

  • Reduce resolution (720p β†’ 512p)
  • Lower frame count (120 β†’ 60 frames)
  • Enable CPU offloading and attention slicing
  • Consider using FP16 encoder variant instead

Slow Generation Speed:

  • FP32 is inherently slower than FP16/FP8
  • Reduce num_inference_steps (50 β†’ 30)
  • Use smaller resolution for previews
  • Ensure CUDA is properly installed and utilized

Loading Errors:

  • Verify GGUF loader compatibility
  • Check file integrity (22 GB expected size)
  • Ensure sufficient disk space and RAM
  • Update diffusers and transformers libraries

Quality Issues:

  • Increase guidance_scale (7.5 β†’ 9.0) for stronger prompt adherence
  • Use more detailed, descriptive prompts
  • Increase num_inference_steps for better quality
  • Check that FP32 precision is actually being used

Support

For issues, questions, or contributions:


Model Version: 2.2 Last Updated: 2024-08-12 README Version: v1.0 Maintained by: WAN Development Team

Downloads last month
65
GGUF
Model size
6B params
Architecture
t5encoder
Hardware compatibility
Log In to view the estimation

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including wangkanai/wan22-fp32-encoders-gguf