Instructions to use Habibur2/Llama-3.2-1B-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Habibur2/Llama-3.2-1B-Instruct-GGUF",
	filename="llama-3.2-1b-f16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Habibur2/Llama-3.2-1B-Instruct-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Habibur2/Llama-3.2-1B-Instruct-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M

Ollama
How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with Ollama:
```
ollama run hf.co/Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
```

Unsloth Studio

How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Habibur2/Llama-3.2-1B-Instruct-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Habibur2/Llama-3.2-1B-Instruct-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Habibur2/Llama-3.2-1B-Instruct-GGUF to start chatting

How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with Docker Model Runner:
```
docker model run hf.co/Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
```

Lemonade

How to use Habibur2/Llama-3.2-1B-Instruct-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Habibur2/Llama-3.2-1B-Instruct-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Llama-3.2-1B-Instruct-GGUF-Q4_K_M

List all available models

lemonade list

🦙 Llama-3.2-1B-Instruct-GGUF [Optimized for Edge AI]

🦙 Llama-3.2-1B-Instruct-GGUF [Benchmarked & Verified]

📌 Model Description

This repository contains manually benchmarked GGUF quantized versions of the Meta Llama 3.2 1B Instruct model.

These models are optimized for Edge AI deployment (Mobile, Raspberry Pi, Laptops) using llama.cpp. Unlike auto-generated quants, these weights have been tested against WikiText-2 to ensure the best balance between speed and accuracy.

🌟 Exclusive Features

🚀 Hyper-Fast: The Q4_K_M version achieves 42+ tokens/sec generation speed on CPU.
📉 Ultra-Low Memory: Runs comfortably on devices with < 1GB RAM (Measured: ~639 MiB).
✅ Verified Quality: Perplexity (PPL) tested on WikiText-2 to guarantee performance.

📊 Benchmark Results (The Science)

Tests were conducted using llama.cpp on a standard CPU setup.

Model Version	Size	Perplexity (PPL)	Quality Loss	Gen Speed (CPU)	Memory Usage
F16 (Original)	2.30 GB	13.99	Baseline	15.73 t/s	~2.4 GB
Q8_0	1.22 GB	14.01	~0.1% (Negligible)	28.43 t/s	~1.3 GB
Q4_K_M	762 MB	14.49	~3.5% (Acceptable)	42.60 t/s 🚀	~640 MB

Conclusion: The Q4_K_M model offers the best trade-off, running 2.7x faster than the original with minimal quality loss.

📥 Which File to Download?

Filename	Description	Use Case
`llama-3.2-1b-q4_k_m.gguf`	🏆 Recommended. Balanced speed & accuracy.	Chatbots, Android/iOS Apps, RAG
`llama-3.2-1b-q8_0.gguf`	High precision, larger size.	Research, Creative Writing
`llama-3.2-1b-f16.gguf`	Uncompressed weights.	Fine-Tuning, Conversion

💻 Quick Usage

Python (Google Colab / Local)

# pip install llama-cpp-python huggingface_hub

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_path = hf_hub_download(
    repo_id="Habibur2/Llama-3.2-1B-Instruct-GGUF",
    filename="llama-3.2-1b-q4_k_m.gguf"
)

llm = Llama(
    model_path=model_path,
    n_ctx=2048,
    verbose=False
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello! Explain AI in one sentence."}]
)
print(response['choices'][0]['message']['content'])

Uploaded by Habibur2 | Benchmarked with WikiText-2 & llama-bench

📊 Detailed Benchmark Results (WikiText-2)

Tests were conducted on llama.cpp (CPU Backend). The results show that quantization has negligible impact on model quality while significantly reducing memory usage.

Model Version	VRAM/RAM Usage	Perplexity (Lower is Better)	Accuracy Loss
F16 (Original)	2,357 MB	13.99	Baseline (0%)
Q8_0	1,252 MB	14.01	+0.01 (Negligible)

Analysis: The Q8_0 version retains 99.99% of the original model's performance while using 47% less memory.

📌 Model Description

This repository contains verified and benchmarked GGUF quantized versions of the Meta Llama 3.2 1B Instruct model.

These models are optimized for Edge AI deployment (Mobile, Raspberry Pi, Laptops) using llama.cpp. Unlike auto-generated quants, these weights have been manually benchmarked to ensure the best balance between speed and accuracy.

🌟 Why use this Repository?

🚀 Real-World Benchmarks: Performance data provided for informed decision-making.
⚡ Ultra-Fast Inference: The Q4_K_M version achieves 40+ tokens/sec on standard CPUs.
📉 Memory Efficient: Runs comfortably on devices with < 2GB RAM.

📊 Benchmark & Performance Data

Tests were conducted using llama.cpp on a standard CPU setup (8 threads).

Quantization	Size (MB)	Compression	Perplexity (PPL)	Speed (CPU)	Recommended For
F16 (Original)	2,300 MB	0%	Baseline	15.73 t/s	Research / GPU
Q8_0	1,220 MB	47%	Low Loss	28.43 t/s	High Accuracy Needs
Q4_K_M	762 MB	68%	Balanced	42.60 t/s 🚀	Edge / Real-time Chat

Note: Speed may vary depending on your hardware. GPU offloading will significantly increase these numbers.

📥 Which File Should I Download?

Filename	Description
`llama-3.2-1b-q4_k_m.gguf`	🏆 Best Choice. High speed, low memory, negligible quality loss.
`llama-3.2-1b-q8_0.gguf`	Near-original quality. Use if you have 4GB+ RAM.
`llama-3.2-1b-f16.gguf`	Uncompressed weights. Use for further conversion or research.

💻 Quick Usage Guide

1. Install llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git

Run in CLI (Chat Mode)

./llama-cli -m llama-3.2-1b-q4_k_m.gguf -cnv -p "You are a helpful assistant."

Python (using llama-cpp-python)


from llama_cpp import Llama

llm = Llama(
    model_path="./llama-3.2-1b-q4_k_m.gguf",
    chat_format="llama-3",
    n_gpu_layers=-1 # Set to 0 if no GPU
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello, explain Quantum Physics in simple terms."}]
)
print(response['choices'][0]['message']['content'])

Downloads last month: 35

GGUF

Model size

1B params

Architecture

llama

Hardware compatibility

4-bit

8-bit

16-bit

Model tree for Habibur2/Llama-3.2-1B-Instruct-GGUF

Base model

meta-llama/Llama-3.2-1B-Instruct

Quantized

(373)

this model