Qwen3-VL-32B-Thinking (Abliterated)
A modified version of Qwen3-VL-32B-Thinking with reduced safety filtering, creating an uncensored vision-language model with enhanced reasoning capabilities. This abliterated variant removes refusal mechanisms from the text component while maintaining the original vision processing capabilities.
Model Description
Qwen3-VL-32B-Thinking-Abliterated is a 33-billion parameter multimodal large language model that combines advanced visual understanding with powerful reasoning capabilities. The model processes images and text inputs simultaneously, excelling at visual agent tasks, GUI recognition, spatial perception, OCR across 32 languages, and STEM reasoning.
Key Features
- Vision-Language Understanding: Simultaneous processing of images and text for comprehensive multimodal analysis
- Advanced Reasoning: Enhanced "Thinking" mode for complex problem-solving and step-by-step reasoning
- Spatial Perception: 3D grounding and precise object positioning capabilities
- Code Generation: Generate executable code from visual inputs (images/videos)
- Massive Context: Native 256K token context (expandable to 1M tokens)
- Multilingual OCR: Support for 32 languages with high accuracy
- Uncensored: Abliterated version with significantly reduced safety filtering
Architecture Innovations:
- Interleaved-MRoPE: Advanced positional embeddings for multimodal understanding
- DeepStack: Multi-level feature fusion for improved visual comprehension
- Text-Timestamp Alignment: Enhanced video understanding capabilities
โ ๏ธ Important Safety Notice: This abliterated model has reduced safety filtering and may generate sensitive, controversial, or inappropriate content. Users must rigorously review outputs and implement appropriate monitoring for production use.
Repository Contents
Model Files
| File | Size | Description |
|---|---|---|
qwen3-vl-32b-thinking-abliterated.safetensors |
63 GB | Complete model weights in SafeTensors format (BF16 precision) |
qwen3-vl-32b-thinking-abliterated-f16.gguf |
62 GB | GGUF format - FP16 precision for llama.cpp compatibility |
qwen3-vl-32b-thinking-abliterated-q8-0.gguf |
33 GB | GGUF format - Q8_0 quantization (8-bit, minimal quality loss) |
qwen3-vl-32b-thinking-abliterated-q4-k-m.gguf |
19 GB | GGUF format - Q4_K_M quantization (4-bit, balanced quality/size) |
README.md |
17 KB | Model documentation and usage guide |
Total Repository Size: ~174 GB (includes multiple quantizations)
Hardware Requirements
Inference
SafeTensors Format (Transformers)
| Precision | VRAM Required | System RAM | Recommended GPU |
|---|---|---|---|
| BF16 (Full) | 66 GB+ | 32 GB+ | NVIDIA A100 (80GB), H100 |
| FP16 | 64 GB+ | 32 GB+ | NVIDIA A100 (80GB), H100 |
| INT8 Quantized | 35-40 GB | 32 GB+ | NVIDIA A6000, RTX 6000 Ada |
| INT4 Quantized | 20-25 GB | 16 GB+ | NVIDIA RTX 4090, A5000 |
GGUF Format (llama.cpp)
| File | Quantization | VRAM Required | System RAM | Recommended GPU |
|---|---|---|---|---|
f16.gguf |
FP16 | 64 GB+ | 32 GB+ | NVIDIA A100 (80GB), H100 |
q8-0.gguf |
Q8_0 (8-bit) | 35 GB+ | 24 GB+ | NVIDIA A6000, RTX 6000 Ada |
q4-k-m.gguf |
Q4_K_M (4-bit) | 20 GB+ | 16 GB+ | NVIDIA RTX 4090, RTX 3090, A5000 |
Note: GGUF models can utilize CPU RAM offloading for systems with insufficient VRAM.
Training/Fine-tuning
- VRAM: 80 GB+ per GPU
- Multi-GPU: Required for full fine-tuning (4x A100 recommended)
- Disk Space: 100 GB+ (including checkpoints and gradients)
- System RAM: 64 GB+
Disk Space
- SafeTensors Only: 63 GB
- All GGUF Formats: 114 GB (FP16 + Q8_0 + Q4_K_M)
- Complete Repository: 174 GB (all formats)
- Cache & Temporary: 10-20 GB
- Total Recommended: 200 GB free space (for complete repository)
Usage Examples
Basic Setup
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
# Load model from local directory
model_path = "E:/huggingface/qwen3-vl-32b-thinking"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
attn_implementation="flash_attention_2" # Recommended for performance
)
processor = AutoProcessor.from_pretrained(
model_path,
trust_remote_code=True
)
Image Understanding
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
model_path = "E:/huggingface/qwen3-vl-32b-thinking"
# Load model and processor
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
model_path,
trust_remote_code=True
)
# Prepare image and prompt
image = Image.open("path/to/image.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
# Generate response
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=40960,
temperature=1.0,
top_p=0.95,
top_k=20
)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
Visual Question Answering with Reasoning
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
model_path = "E:/huggingface/qwen3-vl-32b-thinking"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Load image
image = Image.open("diagram.png")
# Prepare question with reasoning request
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "What is the area of this geometric shape? Think step by step."}
]
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
# Generate with extended reasoning tokens
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=81920, # Extended for reasoning tasks
temperature=1.0,
top_p=0.95,
top_k=20
)
answer = processor.decode(outputs[0], skip_special_tokens=True)
print(answer)
Multi-Image Analysis
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
model_path = "E:/huggingface/qwen3-vl-32b-thinking"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
attn_implementation="flash_attention_2" # Essential for multi-image
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Load multiple images
images = [
Image.open("screenshot1.png"),
Image.open("screenshot2.png"),
Image.open("screenshot3.png")
]
# Create multi-image message
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": images[0]},
{"type": "image", "image": images[1]},
{"type": "image", "image": images[2]},
{"type": "text", "text": "Compare these three screenshots and identify the differences."}
]
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=images, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=40960)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
OCR and Text Extraction
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
model_path = "E:/huggingface/qwen3-vl-32b-thinking"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Load document image
image = Image.open("document.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Extract all text from this document. Preserve formatting and structure."}
]
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=40960)
extracted_text = processor.decode(outputs[0], skip_special_tokens=True)
print(extracted_text)
Code Generation from Visual Input
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
model_path = "E:/huggingface/qwen3-vl-32b-thinking"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Load UI mockup or diagram
image = Image.open("ui_mockup.png")
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Generate React component code for this UI design."}
]
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=40960)
code = processor.decode(outputs[0], skip_special_tokens=True)
print(code)
Model Specifications
| Specification | Details |
|---|---|
| Model Family | Qwen3-VL |
| Variant | Thinking (Abliterated) |
| Parameter Count | 33 Billion |
| Architecture | Vision-Language Transformer |
| Base Model | Qwen/Qwen3-VL-32B-Thinking |
| Precision | BF16 |
| Format | SafeTensors |
| Context Length | 256K tokens (native), 1M tokens (extended) |
| Vision Encoder | Multi-level feature fusion (DeepStack) |
| Positional Encoding | Interleaved-MRoPE |
| OCR Languages | 32 languages supported |
| Video Support | Yes (with text-timestamp alignment) |
Generation Parameters
Vision-Language Tasks:
- Temperature: 1.0
- Top-P: 0.95
- Top-K: 20
- Max Tokens: 40,960
Text-Only Tasks:
- Temperature: 1.0
- Top-P: 0.95
- Top-K: 20
- Max Tokens: 32,768
Reasoning Tasks:
- Temperature: 1.0
- Top-P: 0.95
- Top-K: 20
- Max Tokens: 81,920 (extended for step-by-step reasoning)
Performance Optimization
Memory Optimization (INT8 Quantization)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
model_path = "E:/huggingface/qwen3-vl-32b-thinking"
# Configure 8-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0,
llm_int8_enable_fp32_cpu_offload=True
)
model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=quantization_config,
device_map="auto",
trust_remote_code=True
)
# VRAM usage: ~35-40 GB (down from 66 GB)
Inference Speed (Flash Attention 2)
from transformers import AutoModelForCausalLM
import torch
model_path = "E:/huggingface/qwen3-vl-32b-thinking"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
attn_implementation="flash_attention_2" # 2-3x faster, lower memory
)
Multi-GPU Setup
from transformers import AutoModelForCausalLM
import torch
model_path = "E:/huggingface/qwen3-vl-32b-thinking"
# Automatic layer distribution across GPUs
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto", # Automatically distributes layers
trust_remote_code=True
)
# Example: 2x A100 (40GB) = 80GB total VRAM
Batch Processing Optimization
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
model_path = "E:/huggingface/qwen3-vl-32b-thinking"
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Process multiple images efficiently
images = [Image.open(f"image_{i}.jpg") for i in range(4)]
prompts = ["Describe this image." for _ in range(4)]
# Batch processing
all_messages = []
for img, prompt in zip(images, prompts):
all_messages.append([
{"role": "user", "content": [
{"type": "image", "image": img},
{"type": "text", "text": prompt}
]}
])
# Process in batch
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in all_messages]
inputs = processor(text=texts, images=images, return_tensors="pt", padding=True).to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=40960)
responses = [processor.decode(out, skip_special_tokens=True) for out in outputs]
GGUF Format Usage (llama.cpp)
The repository includes GGUF-format models for use with llama.cpp and compatible backends. GGUF models offer flexibility with CPU/GPU offloading and are optimized for inference.
Installation
# Install llama-cpp-python with GPU support
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
# Or build from source for optimal performance
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Basic GGUF Usage
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
# Choose quantization level based on your hardware
model_path = "E:/huggingface/qwen3-vl-32b-thinking/qwen3-vl-32b-thinking-abliterated-q4-k-m.gguf"
# Initialize model with vision support
llm = Llama(
model_path=model_path,
n_ctx=8192, # Context window (adjust based on RAM)
n_gpu_layers=40, # Offload layers to GPU (adjust based on VRAM)
n_threads=8, # CPU threads for computation
chat_format="llava-1-5", # Vision-language chat format
verbose=False
)
# Generate text response
response = llm.create_chat_completion(
messages=[
{"role": "user", "content": "Explain the concept of neural networks."}
],
temperature=1.0,
max_tokens=2048
)
print(response['choices'][0]['message']['content'])
GGUF Image Understanding
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
import base64
model_path = "E:/huggingface/qwen3-vl-32b-thinking/qwen3-vl-32b-thinking-abliterated-q4-k-m.gguf"
# Load model with vision capabilities
llm = Llama(
model_path=model_path,
n_ctx=8192,
n_gpu_layers=40,
chat_format="llava-1-5"
)
# Load and encode image
with open("image.jpg", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
# Generate response with image
response = llm.create_chat_completion(
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}},
{"type": "text", "text": "Describe this image in detail."}
]
}
],
temperature=1.0,
max_tokens=4096
)
print(response['choices'][0]['message']['content'])
GGUF Memory Optimization
from llama_cpp import Llama
model_path = "E:/huggingface/qwen3-vl-32b-thinking/qwen3-vl-32b-thinking-abliterated-q4-k-m.gguf"
# Low VRAM configuration (GPU + CPU offloading)
llm = Llama(
model_path=model_path,
n_ctx=4096, # Reduced context for lower memory
n_gpu_layers=20, # Partial GPU offload (adjust for your VRAM)
n_threads=16, # More CPU threads for offloaded layers
use_mmap=True, # Memory-map model file (reduces RAM usage)
use_mlock=False, # Don't lock memory (allows swapping if needed)
verbose=False
)
# VRAM usage: ~12-15 GB with n_gpu_layers=20
# Remaining computation runs on CPU
GGUF Quantization Comparison
| Quantization | File Size | Quality | Speed | Recommended Use Case |
|---|---|---|---|---|
| FP16 | 62 GB | 100% | Baseline | Maximum quality, reference standard |
| Q8_0 | 33 GB | 98-99% | 1.2x faster | Near-lossless, production deployments |
| Q4_K_M | 19 GB | 95-97% | 1.8x faster | Balanced quality/performance, most users |
Recommendation: Start with Q4_K_M for best balance. Upgrade to Q8_0 or FP16 if quality issues arise.
Use Cases
Recommended Applications
- ๐ Visual Analysis: Image understanding, scene description, object detection
- ๐ OCR: Document digitization, text extraction (32 languages)
- ๐งฎ STEM Reasoning: Math problem solving, scientific diagram analysis
- ๐ป Code Generation: UI-to-code, diagram-to-implementation
- ๐ฏ Visual Agents: GUI automation, tool invocation from visual input
- ๐ Spatial Understanding: 3D grounding, object positioning, scene layout
- ๐ฌ Video Understanding: Frame analysis, temporal reasoning
Not Recommended For
- โ ๏ธ Safety-Critical Applications: Medical diagnosis, legal advice, financial decisions
- โ ๏ธ Production Without Filtering: Reduced safety filtering requires additional output validation
- โ ๏ธ Real-Time Applications: Large model size may not meet latency requirements
License
License: Apache 2.0
This model is released under the Apache 2.0 license, allowing commercial and non-commercial use with proper attribution.
Base Model: Qwen/Qwen3-VL-32B-Thinking Modification: Abliteration applied by huihui-ai
Important Disclaimers
โ ๏ธ Content Warning: This abliterated version has significantly reduced safety filtering. Generated content may include:
- Sensitive or controversial topics
- Potentially inappropriate material
- Unfiltered responses to prompts
โ ๏ธ User Responsibility:
- Users must rigorously review all generated outputs
- Implement real-time monitoring for production deployments
- Apply additional filtering layers as appropriate
- Comply with applicable laws and regulations
โ ๏ธ No Warranty: The model developers disclaim responsibility for consequences arising from model usage.
Citation
@misc{qwen3vl32bthinking2025,
title={Qwen3-VL-32B-Thinking: Vision-Language Model with Reasoning},
author={Qwen Team, Alibaba Cloud},
year={2025},
month={October},
howpublished={\url{https://huggingface.co/Qwen/Qwen3-VL-32B-Thinking}},
}
@misc{qwen3vl32bthinkingabliterated2025,
title={Huihui-Qwen3-VL-32B-Thinking-Abliterated: Uncensored Vision-Language Model},
author={huihui-ai},
year={2025},
howpublished={\url{https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-32B-Thinking-abliterated}},
note={Abliterated version of Qwen3-VL-32B-Thinking}
}
Resources
Official Documentation
- Qwen3-VL GitHub: https://github.com/QwenLM/Qwen3-VL
- Official Blog: https://qwenlm.github.io/blog/qwen3/
- Qwen Team: https://huggingface.co/Qwen
- Base Model: https://huggingface.co/Qwen/Qwen3-VL-32B-Thinking
- Abliterated Version: https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-32B-Thinking-abliterated
Related Models
- Qwen3-VL-32B-Instruct: Standard instruction-tuned version (with safety filtering)
- Qwen3-VL-32B-Instruct-FP8: Quantized version for reduced memory usage
- Qwen3-VL Collection: https://huggingface.co/collections/Qwen/qwen3-vl
Community & Support
- Transformers Documentation: https://huggingface.co/docs/transformers
- llama.cpp Documentation: https://github.com/ggerganov/llama.cpp
- llama-cpp-python: https://github.com/abetlen/llama-cpp-python
- Flash Attention 2: https://github.com/Dao-AILab/flash-attention
- Model Issues: Report at GitHub repository or Hugging Face model page
Model Status: Active
Last Updated: 2025-11-05
Local Path: E:\huggingface\qwen3-vl-32b-thinking
Available Formats:
- SafeTensors:
qwen3-vl-32b-thinking-abliterated.safetensors(63 GB) - GGUF FP16:
qwen3-vl-32b-thinking-abliterated-f16.gguf(62 GB) - GGUF Q8_0:
qwen3-vl-32b-thinking-abliterated-q8-0.gguf(33 GB) - GGUF Q4_K_M:
qwen3-vl-32b-thinking-abliterated-q4-k-m.gguf(19 GB)
- Downloads last month
- 296
16-bit