Instructions to use Habibur2/Llama-3.2-1B-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Habibur2/Llama-3.2-1B-Instruct-GGUF", filename="llama-3.2-1b-f16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
Use Docker
docker model run hf.co/Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Habibur2/Llama-3.2-1B-Instruct-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Habibur2/Llama-3.2-1B-Instruct-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
- Ollama
How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with Ollama:
ollama run hf.co/Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
- Unsloth Studio
How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Habibur2/Llama-3.2-1B-Instruct-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Habibur2/Llama-3.2-1B-Instruct-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Habibur2/Llama-3.2-1B-Instruct-GGUF to start chatting
- Pi
How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with Docker Model Runner:
docker model run hf.co/Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
- Lemonade
How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Llama-3.2-1B-Instruct-GGUF-Q4_K_M
List all available models
lemonade list
- ๐ฆ Llama-3.2-1B-Instruct-GGUF [Optimized for Edge AI]
- ๐ฆ Llama-3.2-1B-Instruct-GGUF [Benchmarked & Verified]
๐ฆ Llama-3.2-1B-Instruct-GGUF [Optimized for Edge AI]
๐ฆ Llama-3.2-1B-Instruct-GGUF [Benchmarked & Verified]
๐ Model Description
This repository contains manually benchmarked GGUF quantized versions of the Meta Llama 3.2 1B Instruct model.
These models are optimized for Edge AI deployment (Mobile, Raspberry Pi, Laptops) using llama.cpp. Unlike auto-generated quants, these weights have been tested against WikiText-2 to ensure the best balance between speed and accuracy.
๐ Exclusive Features
- ๐ Hyper-Fast: The Q4_K_M version achieves 42+ tokens/sec generation speed on CPU.
- ๐ Ultra-Low Memory: Runs comfortably on devices with < 1GB RAM (Measured: ~639 MiB).
- โ Verified Quality: Perplexity (PPL) tested on WikiText-2 to guarantee performance.
๐ Benchmark Results (The Science)
Tests were conducted using llama.cpp on a standard CPU setup.
| Model Version | Size | Perplexity (PPL) | Quality Loss | Gen Speed (CPU) | Memory Usage |
|---|---|---|---|---|---|
| F16 (Original) | 2.30 GB | 13.99 | Baseline | 15.73 t/s | ~2.4 GB |
| Q8_0 | 1.22 GB | 14.01 | ~0.1% (Negligible) | 28.43 t/s | ~1.3 GB |
| Q4_K_M | 762 MB | 14.49 | ~3.5% (Acceptable) | 42.60 t/s ๐ | ~640 MB |
Conclusion: The Q4_K_M model offers the best trade-off, running 2.7x faster than the original with minimal quality loss.
๐ฅ Which File to Download?
| Filename | Description | Use Case |
|---|---|---|
llama-3.2-1b-q4_k_m.gguf |
๐ Recommended. Balanced speed & accuracy. | Chatbots, Android/iOS Apps, RAG |
llama-3.2-1b-q8_0.gguf |
High precision, larger size. | Research, Creative Writing |
llama-3.2-1b-f16.gguf |
Uncompressed weights. | Fine-Tuning, Conversion |
๐ป Quick Usage
Python (Google Colab / Local)
# pip install llama-cpp-python huggingface_hub
from huggingface_hub import hf_hub_download
from llama_cpp import Llama
model_path = hf_hub_download(
repo_id="Habibur2/Llama-3.2-1B-Instruct-GGUF",
filename="llama-3.2-1b-q4_k_m.gguf"
)
llm = Llama(
model_path=model_path,
n_ctx=2048,
verbose=False
)
response = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello! Explain AI in one sentence."}]
)
print(response['choices'][0]['message']['content'])
Uploaded by Habibur2 | Benchmarked with WikiText-2 & llama-bench
๐ Detailed Benchmark Results (WikiText-2)
Tests were conducted on llama.cpp (CPU Backend). The results show that quantization has negligible impact on model quality while significantly reducing memory usage.
| Model Version | VRAM/RAM Usage | Perplexity (Lower is Better) | Accuracy Loss |
|---|---|---|---|
| F16 (Original) | 2,357 MB | 13.99 | Baseline (0%) |
| Q8_0 | 1,252 MB | 14.01 | +0.01 (Negligible) |
Analysis: The Q8_0 version retains 99.99% of the original model's performance while using 47% less memory.
๐ Model Description
This repository contains verified and benchmarked GGUF quantized versions of the Meta Llama 3.2 1B Instruct model.
These models are optimized for Edge AI deployment (Mobile, Raspberry Pi, Laptops) using llama.cpp. Unlike auto-generated quants, these weights have been manually benchmarked to ensure the best balance between speed and accuracy.
๐ Why use this Repository?
- ๐ Real-World Benchmarks: Performance data provided for informed decision-making.
- โก Ultra-Fast Inference: The Q4_K_M version achieves 40+ tokens/sec on standard CPUs.
- ๐ Memory Efficient: Runs comfortably on devices with < 2GB RAM.
๐ Benchmark & Performance Data
Tests were conducted using llama.cpp on a standard CPU setup (8 threads).
| Quantization | Size (MB) | Compression | Perplexity (PPL) | Speed (CPU) | Recommended For |
|---|---|---|---|---|---|
| F16 (Original) | 2,300 MB | 0% | Baseline | 15.73 t/s | Research / GPU |
| Q8_0 | 1,220 MB | 47% | Low Loss | 28.43 t/s | High Accuracy Needs |
| Q4_K_M | 762 MB | 68% | Balanced | 42.60 t/s ๐ | Edge / Real-time Chat |
Note: Speed may vary depending on your hardware. GPU offloading will significantly increase these numbers.
๐ฅ Which File Should I Download?
| Filename | Description |
|---|---|
llama-3.2-1b-q4_k_m.gguf |
๐ Best Choice. High speed, low memory, negligible quality loss. |
llama-3.2-1b-q8_0.gguf |
Near-original quality. Use if you have 4GB+ RAM. |
llama-3.2-1b-f16.gguf |
Uncompressed weights. Use for further conversion or research. |
๐ป Quick Usage Guide
1. Install llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
- Run in CLI (Chat Mode)
./llama-cli -m llama-3.2-1b-q4_k_m.gguf -cnv -p "You are a helpful assistant."
- Python (using llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="./llama-3.2-1b-q4_k_m.gguf",
chat_format="llama-3",
n_gpu_layers=-1 # Set to 0 if no GPU
)
response = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello, explain Quantum Physics in simple terms."}]
)
print(response['choices'][0]['message']['content'])
- Downloads last month
- 35
4-bit
8-bit
16-bit
Model tree for Habibur2/Llama-3.2-1B-Instruct-GGUF
Base model
meta-llama/Llama-3.2-1B-Instruct