Qwen3-VL-32B-Thinking (Abliterated)

A modified version of Qwen3-VL-32B-Thinking with reduced safety filtering, creating an uncensored vision-language model with enhanced reasoning capabilities. This abliterated variant removes refusal mechanisms from the text component while maintaining the original vision processing capabilities.

Model Description

Qwen3-VL-32B-Thinking-Abliterated is a 33-billion parameter multimodal large language model that combines advanced visual understanding with powerful reasoning capabilities. The model processes images and text inputs simultaneously, excelling at visual agent tasks, GUI recognition, spatial perception, OCR across 32 languages, and STEM reasoning.

Key Features

  • Vision-Language Understanding: Simultaneous processing of images and text for comprehensive multimodal analysis
  • Advanced Reasoning: Enhanced "Thinking" mode for complex problem-solving and step-by-step reasoning
  • Spatial Perception: 3D grounding and precise object positioning capabilities
  • Code Generation: Generate executable code from visual inputs (images/videos)
  • Massive Context: Native 256K token context (expandable to 1M tokens)
  • Multilingual OCR: Support for 32 languages with high accuracy
  • Uncensored: Abliterated version with significantly reduced safety filtering

Architecture Innovations:

  • Interleaved-MRoPE: Advanced positional embeddings for multimodal understanding
  • DeepStack: Multi-level feature fusion for improved visual comprehension
  • Text-Timestamp Alignment: Enhanced video understanding capabilities

โš ๏ธ Important Safety Notice: This abliterated model has reduced safety filtering and may generate sensitive, controversial, or inappropriate content. Users must rigorously review outputs and implement appropriate monitoring for production use.

Repository Contents

Model Files

File Size Description
qwen3-vl-32b-thinking-abliterated.safetensors 63 GB Complete model weights in SafeTensors format (BF16 precision)
qwen3-vl-32b-thinking-abliterated-f16.gguf 62 GB GGUF format - FP16 precision for llama.cpp compatibility
qwen3-vl-32b-thinking-abliterated-q8-0.gguf 33 GB GGUF format - Q8_0 quantization (8-bit, minimal quality loss)
qwen3-vl-32b-thinking-abliterated-q4-k-m.gguf 19 GB GGUF format - Q4_K_M quantization (4-bit, balanced quality/size)
README.md 17 KB Model documentation and usage guide

Total Repository Size: ~174 GB (includes multiple quantizations)

Hardware Requirements

Inference

SafeTensors Format (Transformers)

Precision VRAM Required System RAM Recommended GPU
BF16 (Full) 66 GB+ 32 GB+ NVIDIA A100 (80GB), H100
FP16 64 GB+ 32 GB+ NVIDIA A100 (80GB), H100
INT8 Quantized 35-40 GB 32 GB+ NVIDIA A6000, RTX 6000 Ada
INT4 Quantized 20-25 GB 16 GB+ NVIDIA RTX 4090, A5000

GGUF Format (llama.cpp)

File Quantization VRAM Required System RAM Recommended GPU
f16.gguf FP16 64 GB+ 32 GB+ NVIDIA A100 (80GB), H100
q8-0.gguf Q8_0 (8-bit) 35 GB+ 24 GB+ NVIDIA A6000, RTX 6000 Ada
q4-k-m.gguf Q4_K_M (4-bit) 20 GB+ 16 GB+ NVIDIA RTX 4090, RTX 3090, A5000

Note: GGUF models can utilize CPU RAM offloading for systems with insufficient VRAM.

Training/Fine-tuning

  • VRAM: 80 GB+ per GPU
  • Multi-GPU: Required for full fine-tuning (4x A100 recommended)
  • Disk Space: 100 GB+ (including checkpoints and gradients)
  • System RAM: 64 GB+

Disk Space

  • SafeTensors Only: 63 GB
  • All GGUF Formats: 114 GB (FP16 + Q8_0 + Q4_K_M)
  • Complete Repository: 174 GB (all formats)
  • Cache & Temporary: 10-20 GB
  • Total Recommended: 200 GB free space (for complete repository)

Usage Examples

Basic Setup

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

# Load model from local directory
model_path = "E:/huggingface/qwen3-vl-32b-thinking"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"  # Recommended for performance
)

processor = AutoProcessor.from_pretrained(
    model_path,
    trust_remote_code=True
)

Image Understanding

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_path = "E:/huggingface/qwen3-vl-32b-thinking"

# Load model and processor
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    model_path,
    trust_remote_code=True
)

# Prepare image and prompt
image = Image.open("path/to/image.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Generate response
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=40960,
        temperature=1.0,
        top_p=0.95,
        top_k=20
    )

response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

Visual Question Answering with Reasoning

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_path = "E:/huggingface/qwen3-vl-32b-thinking"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Load image
image = Image.open("diagram.png")

# Prepare question with reasoning request
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What is the area of this geometric shape? Think step by step."}
        ]
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

# Generate with extended reasoning tokens
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=81920,  # Extended for reasoning tasks
        temperature=1.0,
        top_p=0.95,
        top_k=20
    )

answer = processor.decode(outputs[0], skip_special_tokens=True)
print(answer)

Multi-Image Analysis

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_path = "E:/huggingface/qwen3-vl-32b-thinking"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"  # Essential for multi-image
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Load multiple images
images = [
    Image.open("screenshot1.png"),
    Image.open("screenshot2.png"),
    Image.open("screenshot3.png")
]

# Create multi-image message
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": images[0]},
            {"type": "image", "image": images[1]},
            {"type": "image", "image": images[2]},
            {"type": "text", "text": "Compare these three screenshots and identify the differences."}
        ]
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=images, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=40960)

response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

OCR and Text Extraction

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_path = "E:/huggingface/qwen3-vl-32b-thinking"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Load document image
image = Image.open("document.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Extract all text from this document. Preserve formatting and structure."}
        ]
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=40960)

extracted_text = processor.decode(outputs[0], skip_special_tokens=True)
print(extracted_text)

Code Generation from Visual Input

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_path = "E:/huggingface/qwen3-vl-32b-thinking"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Load UI mockup or diagram
image = Image.open("ui_mockup.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Generate React component code for this UI design."}
        ]
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=40960)

code = processor.decode(outputs[0], skip_special_tokens=True)
print(code)

Model Specifications

Specification Details
Model Family Qwen3-VL
Variant Thinking (Abliterated)
Parameter Count 33 Billion
Architecture Vision-Language Transformer
Base Model Qwen/Qwen3-VL-32B-Thinking
Precision BF16
Format SafeTensors
Context Length 256K tokens (native), 1M tokens (extended)
Vision Encoder Multi-level feature fusion (DeepStack)
Positional Encoding Interleaved-MRoPE
OCR Languages 32 languages supported
Video Support Yes (with text-timestamp alignment)

Generation Parameters

Vision-Language Tasks:

  • Temperature: 1.0
  • Top-P: 0.95
  • Top-K: 20
  • Max Tokens: 40,960

Text-Only Tasks:

  • Temperature: 1.0
  • Top-P: 0.95
  • Top-K: 20
  • Max Tokens: 32,768

Reasoning Tasks:

  • Temperature: 1.0
  • Top-P: 0.95
  • Top-K: 20
  • Max Tokens: 81,920 (extended for step-by-step reasoning)

Performance Optimization

Memory Optimization (INT8 Quantization)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_path = "E:/huggingface/qwen3-vl-32b-thinking"

# Configure 8-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0,
    llm_int8_enable_fp32_cpu_offload=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)
# VRAM usage: ~35-40 GB (down from 66 GB)

Inference Speed (Flash Attention 2)

from transformers import AutoModelForCausalLM
import torch

model_path = "E:/huggingface/qwen3-vl-32b-thinking"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"  # 2-3x faster, lower memory
)

Multi-GPU Setup

from transformers import AutoModelForCausalLM
import torch

model_path = "E:/huggingface/qwen3-vl-32b-thinking"

# Automatic layer distribution across GPUs
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",  # Automatically distributes layers
    trust_remote_code=True
)
# Example: 2x A100 (40GB) = 80GB total VRAM

Batch Processing Optimization

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_path = "E:/huggingface/qwen3-vl-32b-thinking"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Process multiple images efficiently
images = [Image.open(f"image_{i}.jpg") for i in range(4)]
prompts = ["Describe this image." for _ in range(4)]

# Batch processing
all_messages = []
for img, prompt in zip(images, prompts):
    all_messages.append([
        {"role": "user", "content": [
            {"type": "image", "image": img},
            {"type": "text", "text": prompt}
        ]}
    ])

# Process in batch
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
         for msg in all_messages]
inputs = processor(text=texts, images=images, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=40960)

responses = [processor.decode(out, skip_special_tokens=True) for out in outputs]

GGUF Format Usage (llama.cpp)

The repository includes GGUF-format models for use with llama.cpp and compatible backends. GGUF models offer flexibility with CPU/GPU offloading and are optimized for inference.

Installation

# Install llama-cpp-python with GPU support
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

# Or build from source for optimal performance
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Basic GGUF Usage

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler

# Choose quantization level based on your hardware
model_path = "E:/huggingface/qwen3-vl-32b-thinking/qwen3-vl-32b-thinking-abliterated-q4-k-m.gguf"

# Initialize model with vision support
llm = Llama(
    model_path=model_path,
    n_ctx=8192,           # Context window (adjust based on RAM)
    n_gpu_layers=40,      # Offload layers to GPU (adjust based on VRAM)
    n_threads=8,          # CPU threads for computation
    chat_format="llava-1-5",  # Vision-language chat format
    verbose=False
)

# Generate text response
response = llm.create_chat_completion(
    messages=[
        {"role": "user", "content": "Explain the concept of neural networks."}
    ],
    temperature=1.0,
    max_tokens=2048
)

print(response['choices'][0]['message']['content'])

GGUF Image Understanding

from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
import base64

model_path = "E:/huggingface/qwen3-vl-32b-thinking/qwen3-vl-32b-thinking-abliterated-q4-k-m.gguf"

# Load model with vision capabilities
llm = Llama(
    model_path=model_path,
    n_ctx=8192,
    n_gpu_layers=40,
    chat_format="llava-1-5"
)

# Load and encode image
with open("image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

# Generate response with image
response = llm.create_chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}},
                {"type": "text", "text": "Describe this image in detail."}
            ]
        }
    ],
    temperature=1.0,
    max_tokens=4096
)

print(response['choices'][0]['message']['content'])

GGUF Memory Optimization

from llama_cpp import Llama

model_path = "E:/huggingface/qwen3-vl-32b-thinking/qwen3-vl-32b-thinking-abliterated-q4-k-m.gguf"

# Low VRAM configuration (GPU + CPU offloading)
llm = Llama(
    model_path=model_path,
    n_ctx=4096,           # Reduced context for lower memory
    n_gpu_layers=20,      # Partial GPU offload (adjust for your VRAM)
    n_threads=16,         # More CPU threads for offloaded layers
    use_mmap=True,        # Memory-map model file (reduces RAM usage)
    use_mlock=False,      # Don't lock memory (allows swapping if needed)
    verbose=False
)

# VRAM usage: ~12-15 GB with n_gpu_layers=20
# Remaining computation runs on CPU

GGUF Quantization Comparison

Quantization File Size Quality Speed Recommended Use Case
FP16 62 GB 100% Baseline Maximum quality, reference standard
Q8_0 33 GB 98-99% 1.2x faster Near-lossless, production deployments
Q4_K_M 19 GB 95-97% 1.8x faster Balanced quality/performance, most users

Recommendation: Start with Q4_K_M for best balance. Upgrade to Q8_0 or FP16 if quality issues arise.

Use Cases

Recommended Applications

  • ๐Ÿ” Visual Analysis: Image understanding, scene description, object detection
  • ๐Ÿ“ OCR: Document digitization, text extraction (32 languages)
  • ๐Ÿงฎ STEM Reasoning: Math problem solving, scientific diagram analysis
  • ๐Ÿ’ป Code Generation: UI-to-code, diagram-to-implementation
  • ๐ŸŽฏ Visual Agents: GUI automation, tool invocation from visual input
  • ๐ŸŒ Spatial Understanding: 3D grounding, object positioning, scene layout
  • ๐ŸŽฌ Video Understanding: Frame analysis, temporal reasoning

Not Recommended For

  • โš ๏ธ Safety-Critical Applications: Medical diagnosis, legal advice, financial decisions
  • โš ๏ธ Production Without Filtering: Reduced safety filtering requires additional output validation
  • โš ๏ธ Real-Time Applications: Large model size may not meet latency requirements

License

License: Apache 2.0

This model is released under the Apache 2.0 license, allowing commercial and non-commercial use with proper attribution.

Base Model: Qwen/Qwen3-VL-32B-Thinking Modification: Abliteration applied by huihui-ai

Important Disclaimers

โš ๏ธ Content Warning: This abliterated version has significantly reduced safety filtering. Generated content may include:

  • Sensitive or controversial topics
  • Potentially inappropriate material
  • Unfiltered responses to prompts

โš ๏ธ User Responsibility:

  • Users must rigorously review all generated outputs
  • Implement real-time monitoring for production deployments
  • Apply additional filtering layers as appropriate
  • Comply with applicable laws and regulations

โš ๏ธ No Warranty: The model developers disclaim responsibility for consequences arising from model usage.

Citation

@misc{qwen3vl32bthinking2025,
  title={Qwen3-VL-32B-Thinking: Vision-Language Model with Reasoning},
  author={Qwen Team, Alibaba Cloud},
  year={2025},
  month={October},
  howpublished={\url{https://huggingface.co/Qwen/Qwen3-VL-32B-Thinking}},
}

@misc{qwen3vl32bthinkingabliterated2025,
  title={Huihui-Qwen3-VL-32B-Thinking-Abliterated: Uncensored Vision-Language Model},
  author={huihui-ai},
  year={2025},
  howpublished={\url{https://huggingface.co/huihui-ai/Huihui-Qwen3-VL-32B-Thinking-abliterated}},
  note={Abliterated version of Qwen3-VL-32B-Thinking}
}

Resources

Official Documentation

Related Models

Community & Support


Model Status: Active Last Updated: 2025-11-05 Local Path: E:\huggingface\qwen3-vl-32b-thinking Available Formats:

  • SafeTensors: qwen3-vl-32b-thinking-abliterated.safetensors (63 GB)
  • GGUF FP16: qwen3-vl-32b-thinking-abliterated-f16.gguf (62 GB)
  • GGUF Q8_0: qwen3-vl-32b-thinking-abliterated-q8-0.gguf (33 GB)
  • GGUF Q4_K_M: qwen3-vl-32b-thinking-abliterated-q4-k-m.gguf (19 GB)
Downloads last month
296
GGUF
Model size
33B params
Architecture
qwen3vl
Hardware compatibility
Log In to view the estimation

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including wangkanai/qwen3-vl-32b-thinking