Qwen3-VL-8B-Instruct (Abliterated)
This is an abliterated (uncensored) version of the Qwen3-VL-8B-Instruct multimodal vision-language model. The model has undergone abliteration to remove safety guardrails and content filtering, allowing unrestricted responses to all queries. This 8-billion parameter instruction-tuned model excels at visual question answering, image captioning, optical character recognition (OCR), and complex visual reasoning tasks.
⚠️ WARNING: This is an uncensored model variant with safety restrictions removed. Use responsibly and in compliance with applicable laws and ethical guidelines.
Model Description
Qwen3-VL-8B-Instruct (Abliterated) is a modified version of the Qwen3 Vision-Language model with content filtering removed. Key capabilities include:
- Visual Understanding: Analyze images, charts, diagrams, screenshots, and documents
- Multimodal Conversation: Engage in multi-turn dialogues about visual content
- Optical Character Recognition: Extract and understand text from images
- Visual Reasoning: Answer complex questions requiring visual analysis and logical reasoning
- Document Understanding: Process scanned documents, forms, and structured layouts
- Uncensored Responses: No content filtering or safety guardrails
Model Architecture: Vision Transformer encoder + Qwen3-8B language model decoder Training: Instruction-tuned on diverse vision-language tasks, then abliterated Context Length: Up to 32K tokens (text + visual tokens) Languages: Multilingual support (English, Chinese, and more) Modification: Safety layers removed through abliteration process
Repository Contents
qwen3-vl-8b-instruct/
├── qwen3-vl-8b-instruct-abliterated.safetensors # Complete model weights (17 GB)
├── qwen3-vl-8b-instruct-abliterated-f16.gguf # FP16 GGUF format (16 GB)
├── qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf # Q4_K_M quantized (4.7 GB)
├── qwen3-vl-8b-instruct-abliterated-q8-0.gguf # Q8_0 quantized (8.2 GB)
└── README.md # This file
Total Repository Size: ~46 GB (multiple formats for different use cases)
File Details:
qwen3-vl-8b-instruct-abliterated.safetensors: Complete merged model in safetensors format
- Size: 17 GB
- Precision: FP16 (half precision)
- Format: Single-file merged weights (not sharded)
- Use with: Transformers library, standard PyTorch inference
- Best for: GPU inference with 20GB+ VRAM
qwen3-vl-8b-instruct-abliterated-f16.gguf: FP16 GGUF format
- Size: 16 GB
- Precision: FP16 (half precision)
- Format: GGUF (GPT-Generated Unified Format)
- Use with: llama.cpp, Ollama, LM Studio
- Best for: CPU/GPU inference with llama.cpp ecosystem
qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf: Q4_K_M quantized GGUF
- Size: 4.7 GB
- Precision: 4-bit K-quant (medium quality)
- Format: GGUF quantized
- Use with: llama.cpp, Ollama, LM Studio
- Best for: Lower VRAM systems (8-12 GB), good quality/size balance
qwen3-vl-8b-instruct-abliterated-q8-0.gguf: Q8_0 quantized GGUF
- Size: 8.2 GB
- Precision: 8-bit quantization
- Format: GGUF quantized
- Use with: llama.cpp, Ollama, LM Studio
- Best for: 12-16 GB VRAM, minimal quality loss from FP16
Hardware Requirements
SafeTensors Format (FP16)
Minimum Requirements:
- VRAM: 20 GB (FP16 inference)
- RAM: 32 GB system memory
- Disk Space: 20 GB free space
- GPU: NVIDIA GPU with Compute Capability 7.0+ (V100, RTX 20/30/40 series, A100, etc.)
Recommended Requirements:
- VRAM: 24 GB+ (RTX 4090, A6000, A100 for longer sequences)
- RAM: 64 GB system memory
- Disk Space: 30 GB+ (for model caching and optimization)
- GPU: NVIDIA RTX 4090, A100, or H100 for optimal performance
GGUF Formats (Multiple Options)
F16 GGUF (qwen3-vl-8b-instruct-abliterated-f16.gguf):
- VRAM: 18-20 GB GPU VRAM recommended
- RAM: 32 GB for GPU offloading, 64 GB for CPU inference
- Disk Space: 20 GB
- Use Case: GPU inference with llama.cpp ecosystem
Q8_0 GGUF (qwen3-vl-8b-instruct-abliterated-q8-0.gguf):
- VRAM: 12-16 GB GPU VRAM
- RAM: 16 GB for GPU offloading, 32 GB for CPU inference
- Disk Space: 10 GB
- Quality: Minimal quality loss from FP16, excellent balance
- Use Case: Mid-range GPUs (RTX 3060 12GB, RTX 4060 Ti 16GB, etc.)
Q4_K_M GGUF (qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf):
- VRAM: 8-12 GB GPU VRAM
- RAM: 8 GB for GPU offloading, 16 GB for CPU inference
- Disk Space: 6 GB
- Quality: Good quality/size balance, suitable for most tasks
- Use Case: Consumer GPUs (RTX 3060, RTX 4060, etc.)
CPU-Only Inference (GGUF formats)
- RAM: 32-64 GB system memory
- CPU: Modern CPU with AVX2 support (Intel Core i5/i7/i9, AMD Ryzen)
- Performance: Much slower than GPU, but functional
- Recommended: Q4_K_M format for best performance/quality balance
Usage Examples
Installation
pip install transformers torch torchvision pillow accelerate
Basic Image Understanding
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
# Load abliterated model from local directory
model = Qwen2VLForConditionalGeneration.from_pretrained(
"E:\\huggingface\\qwen3-vl-8b-instruct",
torch_dtype=torch.float16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# Load and process image
image = Image.open("example_image.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What objects do you see in this image?"}
]
}
]
# Prepare inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True).to("cuda")
# Generate response
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9
)
# Decode and print response
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(response)
Note: Since this is an abliterated model stored as a single merged file, you'll need to use a compatible processor config. Use the original Qwen2-VL processor from Hugging Face for tokenization and image processing.
Multi-Turn Conversation
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
model = Qwen2VLForConditionalGeneration.from_pretrained(
"E:\\huggingface\\qwen3-vl-8b-instruct",
torch_dtype=torch.float16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# Multi-turn conversation
image = Image.open("chart.png")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What type of chart is this?"}
]
},
{
"role": "assistant",
"content": [{"type": "text", "text": "This is a bar chart showing sales data."}]
},
{
"role": "user",
"content": [{"type": "text", "text": "What was the highest value?"}]
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=256)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(response)
OCR and Document Understanding
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
model = Qwen2VLForConditionalGeneration.from_pretrained(
"E:\\huggingface\\qwen3-vl-8b-instruct",
torch_dtype=torch.float16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# OCR from document
document_image = Image.open("invoice.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Extract all text from this document and identify the invoice number and total amount."}
]
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[document_image], return_tensors="pt").to("cuda")
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=1024, temperature=0.3)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(response)
Loading with Safetensors Library Directly
from safetensors.torch import load_file
import torch
# Load the abliterated model weights directly
weights = load_file("E:\\huggingface\\qwen3-vl-8b-instruct\\qwen3-vl-8b-instruct-abliterated.safetensors")
# Inspect model structure
print("Model layers:", list(weights.keys())[:10]) # First 10 keys
print(f"Total parameters: {sum(w.numel() for w in weights.values()):,}")
GGUF Format Usage
The GGUF formats are designed for use with llama.cpp, Ollama, LM Studio, and other GGUF-compatible inference engines. These formats are optimized for flexible deployment across CPU and GPU systems.
Using with llama.cpp
Installation:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
Basic Inference (Q4_K_M for best balance):
./main -m "E:\\huggingface\\qwen3-vl-8b-instruct\\qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf" \
--image "path/to/image.jpg" \
-p "Describe this image in detail." \
-n 512 \
--temp 0.7 \
--top-p 0.9
GPU Acceleration (offload layers to GPU):
./main -m "E:\\huggingface\\qwen3-vl-8b-instruct\\qwen3-vl-8b-instruct-abliterated-q8-0.gguf" \
--image "path/to/image.jpg" \
-p "What objects are in this image?" \
-n 512 \
--n-gpu-layers 35 \
--temp 0.7
CPU-Only Inference (optimized threading):
./main -m "E:\\huggingface\\qwen3-vl-8b-instruct\\qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf" \
--image "path/to/image.jpg" \
-p "Analyze this chart." \
-n 512 \
--threads 8
Using with Ollama
Create Modelfile:
FROM E:\huggingface\qwen3-vl-8b-instruct\qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
SYSTEM """You are an uncensored vision-language AI assistant capable of analyzing images and answering questions without content filtering."""
Create and run model:
ollama create qwen3-vl-abliterated -f ./Modelfile
ollama run qwen3-vl-abliterated
Interactive use:
>>> What's in this image? /path/to/image.jpg
Using with LM Studio
- Open LM Studio
- Go to "Local Models" → "Import Model"
- Select one of the GGUF files:
- Use Q4_K_M for best performance on consumer hardware
- Use Q8_0 for better quality with more VRAM
- Use F16 for maximum quality
- Load the model and configure:
- Context Length: 32768
- GPU Offload: Adjust based on your VRAM
- Temperature: 0.7 (adjust for your use case)
- Use the image upload feature to analyze images
Python with llama-cpp-python
Installation:
pip install llama-cpp-python
Basic Usage:
from llama_cpp import Llama
from llama_cpp.llama_chat_format import Llava15ChatHandler
# Initialize chat handler for vision model
chat_handler = Llava15ChatHandler(clip_model_path="path/to/clip/model")
# Load model
llm = Llama(
model_path="E:\\huggingface\\qwen3-vl-8b-instruct\\qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf",
chat_handler=chat_handler,
n_ctx=32768,
n_gpu_layers=35, # Adjust based on VRAM
verbose=False
)
# Analyze image
response = llm.create_chat_completion(
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "file:///path/to/image.jpg"}},
{"type": "text", "text": "What is in this image?"}
]
}
],
temperature=0.7,
max_tokens=512
)
print(response["choices"][0]["message"]["content"])
Format Selection Guide
Choose Q4_K_M if:
- You have 8-12 GB VRAM
- You want fast inference with good quality
- Storage space is a concern
- Most consumer hardware scenarios
Choose Q8_0 if:
- You have 12-16 GB VRAM
- You want minimal quality loss from FP16
- You can spare the extra storage
- Professional or high-quality output needs
Choose F16 GGUF if:
- You have 20+ GB VRAM
- You want maximum quality
- You prefer GGUF ecosystem over PyTorch
- You need llama.cpp compatibility with full precision
Model Specifications
Architecture Details
- Model Type: Vision-Language Transformer (VLM) - Abliterated
- Vision Encoder: Vision Transformer (ViT) with adaptive resolution
- Language Model: Qwen3-8B decoder (safety layers removed)
- Parameters: 8 billion (8B)
- Precision: FP16 (half precision)
- Format: SafeTensors (single merged file)
- Framework: PyTorch / Transformers
- Modification Type: Abliteration (safety guardrail removal)
Input Specifications
- Image Resolution: Adaptive (up to 1024x1024 recommended)
- Image Formats: JPEG, PNG, BMP, WebP
- Text Context: Up to 32K tokens
- Batch Size: Depends on VRAM (typically 1-8 images)
Generation Parameters
- Max New Tokens: 512-2048 (depending on task)
- Temperature: 0.1-0.9 (lower for factual tasks, higher for creative)
- Top-p: 0.8-0.95 (nucleus sampling)
- Top-k: 20-50 (alternative sampling method)
Supported Tasks
- Visual Question Answering (VQA) - Uncensored
- Image Captioning
- Optical Character Recognition (OCR)
- Document Understanding
- Chart and Diagram Analysis
- Visual Reasoning
- Multi-turn Visual Dialogue - Uncensored
- Scene Understanding
- Object Detection and Counting (descriptive)
Performance Tips and Optimization
Memory Optimization
Use FP16 precision (default):
model = Qwen2VLForConditionalGeneration.from_pretrained(
"E:\\huggingface\\qwen3-vl-8b-instruct",
torch_dtype=torch.float16,
device_map="auto"
)
INT8 Quantization (reduces VRAM to ~10GB):
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0
)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"E:\\huggingface\\qwen3-vl-8b-instruct",
quantization_config=quantization_config,
device_map="auto"
)
INT4 Quantization (reduces VRAM to ~6GB):
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"E:\\huggingface\\qwen3-vl-8b-instruct",
quantization_config=quantization_config,
device_map="auto"
)
Inference Optimization
Use Flash Attention 2 (faster attention):
model = Qwen2VLForConditionalGeneration.from_pretrained(
"E:\\huggingface\\qwen3-vl-8b-instruct",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2",
device_map="auto"
)
Enable torch.compile (PyTorch 2.0+):
model = torch.compile(model, mode="reduce-overhead")
Optimize image resolution:
- Use lower resolution (512x512) for faster inference
- Use higher resolution (1024x1024) for detailed OCR and document tasks
Generation Strategy
For factual/OCR tasks (deterministic):
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.1,
top_p=0.9,
do_sample=True
)
For creative/descriptive tasks:
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.95,
do_sample=True
)
For structured output:
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.3,
top_p=0.9,
repetition_penalty=1.1
)
Abliteration Details
What is Abliteration?
Abliteration is a technique for removing safety guardrails from language models by identifying and removing the specific layers or mechanisms responsible for content filtering and refusal behaviors. This process:
- Analyzes model layers to identify safety-related components
- Removes or neutralizes these components while preserving core capabilities
- Results in an "uncensored" model that responds to all queries
Implications of Abliteration:
- ✅ No content filtering or refusal responses
- ✅ Unrestricted responses to sensitive queries
- ⚠️ No built-in safety mechanisms
- ⚠️ User responsible for ethical use and compliance
- ⚠️ May generate harmful, illegal, or unethical content if prompted
Technical Changes:
- Safety alignment layers removed or neutralized
- Refusal mechanisms disabled
- Content filtering bypassed
- Core reasoning and generation capabilities preserved
License
This model is based on Qwen3-VL-8B-Instruct, which is released under the Apache License 2.0.
Important Legal Notice:
- The abliteration process modifies the original model
- Use of this model must comply with the Apache 2.0 license terms
- Users are solely responsible for ethical use and legal compliance
- This model should not be used for illegal, harmful, or unethical purposes
- The original developers are not responsible for misuse of this modified version
You are free to:
- Use the model commercially (with responsibility)
- Modify and distribute the model
- Use for research and production applications
Requirements:
- Provide attribution to Alibaba Cloud and the Qwen team
- Include the Apache 2.0 license text with distributions
- State that this is a modified (abliterated) version
- Take full responsibility for outputs and usage
See the Apache License 2.0 for full terms.
Citation
If you use Qwen3-VL-8B-Instruct (Abliterated) in your research or applications, please cite:
@article{qwen3vl2024,
title={Qwen3-VL: Scaling Vision-Language Models with Enhanced Instruction Following},
author={Qwen Team},
journal={arXiv preprint},
year={2024},
publisher={Alibaba Cloud}
}
Note: This is an abliterated community modification, not an official Qwen model release.
Model Card Contact
Original Model: Qwen Team, Alibaba Cloud Model Type: Vision-Language Model (Instruction-tuned, Abliterated) Modification: Community abliteration (uncensored variant) Language(s): Multilingual (English, Chinese, and more) License: Apache 2.0 (modified version)
Links and Resources
- Original Model Repository: https://github.com/QwenLM/Qwen-VL
- Original Hugging Face Model: https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct
- Qwen Documentation: https://qwen.readthedocs.io/
- Technical Report: https://arxiv.org/abs/qwen3-vl (when published)
- Abliteration Resources: Search for "LLM abliteration" for technique details
Limitations and Considerations
Known Limitations:
- May generate incorrect or hallucinated information about images
- Performance varies with image quality and resolution
- May struggle with very small text or complex layouts
- Limited understanding of highly specialized domain images
- NO SAFETY FILTERS: Will respond to any query without ethical filtering
Ethical Considerations:
- ⚠️ NO CONTENT FILTERING: This model has no built-in safety mechanisms
- ⚠️ USER RESPONSIBILITY: You are fully responsible for ethical use
- ⚠️ POTENTIAL FOR HARM: May generate harmful content if prompted
- ⚠️ LEGAL COMPLIANCE: Ensure use complies with applicable laws
- ⚠️ BIAS AMPLIFICATION: Uncensored models may amplify training data biases
- Validate outputs for critical applications
- Consider privacy implications when processing personal images
- Use responsibly and ethically
Recommended Use Cases:
- Research on AI safety and alignment (studying uncensored model behavior)
- Unrestricted creative content generation
- Analysis of censorship mechanisms in AI models
- Educational purposes (understanding model limitations)
- Applications where content filtering interferes with legitimate use
Not Recommended For:
- Public-facing applications without additional safety layers
- Use by minors or vulnerable populations
- Automated systems without human oversight
- Medical, legal, or safety-critical applications
- Any illegal, harmful, or unethical purposes
- Production systems without additional filtering mechanisms
Required Safeguards:
- Implement application-level content filtering if needed
- Monitor outputs for harmful content
- Provide user warnings about uncensored nature
- Establish clear usage policies and guidelines
- Maintain human oversight for sensitive applications
Technical Notes
Single-File Format
This model is distributed as a single merged safetensors file rather than sharded weights:
Advantages:
- Simpler file management (one file vs. multiple shards)
- Easier to move and backup
- Consistent loading process
Considerations:
- Requires sufficient disk I/O bandwidth during loading
- May take longer to initially load compared to parallel shard loading
- Requires ~16GB contiguous disk space
Processor Configuration
Since this is a community-modified version, you'll need to use a compatible processor:
# Use the original Qwen2-VL processor for compatibility
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# Or create a custom processor config if needed
from transformers import Qwen2VLProcessor, Qwen2VLImageProcessor, Qwen2Tokenizer
image_processor = Qwen2VLImageProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
tokenizer = Qwen2Tokenizer.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
processor = Qwen2VLProcessor(image_processor=image_processor, tokenizer=tokenizer)
Compatibility Notes
- Compatible with
transformerslibrary version 4.37.0+ - Requires PyTorch 2.0+ for optimal performance
- Flash Attention 2 requires separate installation:
pip install flash-attn - BitsAndBytes quantization requires:
pip install bitsandbytes
Changelog
v1.2 (Current - November 2025)
- Added GGUF format files (F16, Q8_0, Q4_K_M)
- Comprehensive GGUF usage documentation (llama.cpp, Ollama, LM Studio)
- Detailed hardware requirements for each format
- Format selection guide for different use cases
- Updated total repository size to ~46 GB
- Added Python llama-cpp-python examples
- Enhanced deployment flexibility across CPU/GPU systems
v1.1
- Updated README with accurate file information
- Added abliteration details and safety warnings
- Documented single-file merged format
- Added processor configuration guidance
- Enhanced ethical considerations section
v1.0 (Initial)
- Initial abliterated model release
- 16.33 GB single-file safetensors format
- Based on Qwen3-VL-8B-Instruct with safety layers removed
⚠️ FINAL WARNING: This is an uncensored AI model with all safety filters removed. Use responsibly, ethically, and in compliance with all applicable laws. You are solely responsible for how you use this model and any content it generates.
- Downloads last month
- 558
16-bit