EEVE-VSS-SMH-BNB-4bit
4-bit Quantized Version | 4-bit ์์ํ ๋ฒ์
English
Model Description
This model is a BitsAndBytes NF4 4-bit quantized version of MyeongHo0621/eeve-vss-smh.
Key Features
- โ Low-VRAM Support: Works on GTX series GPUs with 6GB VRAM
- โ 4-bit Quantization: NF4 (NormalFloat4) with minimal quality loss (1-2%)
- โ High-Quality Korean: Maintains original model performance
Quick Start
Installation
pip install transformers torch bitsandbytes accelerate
Required: bitsandbytes library is mandatory!
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
# 4-bit configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
"MyeongHo0621/eeve-vss-smh-bnb-4bit",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")
# Prompt template
def create_prompt(user_input):
return f"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: {user_input}
Assistant: """
# Generate
user_input = "Implement Fibonacci sequence in Python"
prompt = create_prompt(user_input)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.3,
top_p=0.85,
repetition_penalty=1.0,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
Alternative: Using torch.dtype Directly
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load with explicit dtype (automatic 4-bit loading)
model = AutoModelForCausalLM.from_pretrained(
"MyeongHo0621/eeve-vss-smh-bnb-4bit",
device_map="auto",
torch_dtype=torch.float16, # or torch.bfloat16
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")
Simplified Method (Auto-load quantization config)
from transformers import AutoModelForCausalLM, AutoTokenizer
# Automatically loads saved quantization settings
model = AutoModelForCausalLM.from_pretrained(
"MyeongHo0621/eeve-vss-smh-bnb-4bit",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")
System Requirements
Minimum Specifications
| Component | Minimum | Recommended |
|---|---|---|
| GPU | GTX 1660 (6GB) | RTX 3060 (12GB) |
| VRAM | 4GB | 6GB+ |
| RAM | 8GB | 16GB+ |
| CUDA | 11.0+ | 12.0+ |
Tested Environments
- โ GTX 1660 (6GB VRAM) - Works
- โ RTX 2060 (6GB VRAM) - Works
- โ RTX 3060 (12GB VRAM) - Good
- โ RTX 3090 (24GB VRAM) - Excellent
- โ H100 (80GB VRAM) - Overkill
Quantization Details
BitsAndBytes NF4
Quantization Type: NF4 (NormalFloat4)
Bits: 4-bit
Compute dtype: float16
Double Quantization: True
Method: Weight-only quantization
Performance Comparison
| Version | Model Size | VRAM Usage | Quality Loss | Inference Speed |
|---|---|---|---|---|
| FP16 Original | ~21GB | ~21GB | 0% | โกโกโกโก |
| BNB 4-bit | ~5.5GB | ~3.5GB | 1-2% | โกโกโก |
Recommended Generation Parameters
generation_config = {
"max_new_tokens": 512,
"temperature": 0.3,
"top_p": 0.85,
"repetition_penalty": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.pad_token_id,
"eos_token_id": tokenizer.eos_token_id,
}
Parameter Guide by Use Case
| Use Case | Temperature | Top P | Notes |
|---|---|---|---|
| Factual Answers | 0.1-0.3 | 0.8-0.9 | Fact-based questions |
| Balanced | 0.5-0.7 | 0.85-0.95 | General usage |
| Creative | 0.8-1.0 | 0.9-1.0 | Stories, poems |
Example Outputs
Code Generation
Input:
Implement a Python function to reverse a list
Output:
# High-quality code generation like the original model
my_list = [1, 2, 3, 4, 5]
# Method 1: reverse()
my_list.reverse()
# Method 2: slicing
reversed_list = my_list[::-1]
# Method 3: reversed()
reversed_list = list(reversed(my_list))
Original Model Information
This is a quantized version of:
- Original Model: MyeongHo0621/eeve-vss-smh
- Base Model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
- Training Data: 100K+ high-quality Korean instruction data
- LoRA Config: r=64, alpha=128, dropout=0.05
For detailed training process, see original model page.
Troubleshooting
CUDA Out of Memory
# Use lower batch size
generation_config = {
"max_new_tokens": 256, # 512 โ 256
...
}
bitsandbytes Installation Error
# Check CUDA version
nvidia-smi
# CUDA 11.x
pip install bitsandbytes
# CUDA 12.x
pip install bitsandbytes --upgrade
Slow Generation Speed
- 4-bit quantization may be slightly slower than FP16
- For faster speed, use FP16 original model
Use Cases
โ Suitable For
- Low-end GPUs (GTX 1660, RTX 2060)
- VRAM-constrained environments
- Local development and testing
- Personal projects
- Research and education
โ ๏ธ Not Recommended For
- Production requiring ultra-fast inference
- Environments with sufficient high-end GPUs โ Use FP16 original
Limitations
- ~1-2% quality loss due to 4-bit quantization
- Slightly slower inference than FP16
- Requires
bitsandbyteslibrary - Windows may require additional setup for bitsandbytes
License
- Model License: CC-BY-NC-SA-4.0
- Base Model: EEVE-Korean-Instruct-10.8B-v1.0
- Commercial Use: Limited (see license)
Citation
@misc{eeve-vss-smh-bnb-4bit-2025,
author = {MyeongHo0621},
title = {EEVE-VSS-SMH-BNB-4bit: 4-bit Quantized Korean Model},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/MyeongHo0621/eeve-vss-smh-bnb-4bit}},
note = {4-bit quantized version using BitsAndBytes NF4}
}
Acknowledgments
- Original Model: MyeongHo0621/eeve-vss-smh
- Base Model: Yanolja EEVE
- Quantization Library: BitsAndBytes
- Framework: Hugging Face Transformers
Related Models
| Model | Size | VRAM | Use Case |
|---|---|---|---|
| eeve-vss-smh | 21GB | 21GB | High-end GPUs |
| eeve-vss-smh-bnb-4bit | 5.5GB | 3.5GB | Low-end GPUs โญ |
Contact
- Original Model: eeve-vss-smh
Quantization Date: 2025-10-11
Method: BitsAndBytes NF4 4-bit
Status: Ready for Low-VRAM Deployment ๐
ํ๊ตญ์ด
๋ชจ๋ธ ์๊ฐ
์ด ๋ชจ๋ธ์ MyeongHo0621/eeve-vss-smh๋ฅผ BitsAndBytes NF4 4-bit๋ก ์์ํํ ๋ฒ์ ์ ๋๋ค.
์ฃผ์ ํน์ง
- โ ์ ์ฌ์ GPU ์ง์: GTX ์๋ฆฌ์ฆ, 6GB VRAM์์๋ ์คํ ๊ฐ๋ฅ
- โ 4-bit ์์ํ: NF4 (NormalFloat4) - ํ์ง ์์ค ์ต์ (1-2%)
- โ ๊ณ ํ์ง ํ๊ตญ์ด: ์๋ณธ ๋ชจ๋ธ์ ์ฑ๋ฅ ์ ์ง
๋น ๋ฅธ ์์
์ค์น
pip install transformers torch bitsandbytes accelerate
ํ์: bitsandbytes ๋ผ์ด๋ธ๋ฌ๋ฆฌ๊ฐ ๋ฐ๋์ ํ์ํฉ๋๋ค!
๊ธฐ๋ณธ ์ฌ์ฉ
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
# 4-bit ์ค์
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
# ๋ชจ๋ธ ๋ก๋
model = AutoModelForCausalLM.from_pretrained(
"MyeongHo0621/eeve-vss-smh-bnb-4bit",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")
# ํ๋กฌํํธ ํ
ํ๋ฆฟ
def create_prompt(user_input):
return f"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: {user_input}
Assistant: """
# ๋ํ
user_input = "ํ์ด์ฌ์ผ๋ก ํผ๋ณด๋์น ์์ด ๊ตฌํํด์ค"
prompt = create_prompt(user_input)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.3,
top_p=0.85,
repetition_penalty=1.0,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
๋์: torch.dtype ์ง์ ์ฌ์ฉ
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# dtype ๋ช
์์ ์ง์ (์๋ 4-bit ๋ก๋ฉ)
model = AutoModelForCausalLM.from_pretrained(
"MyeongHo0621/eeve-vss-smh-bnb-4bit",
device_map="auto",
torch_dtype=torch.float16, # ๋๋ torch.bfloat16
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")
๊ฐ๋จํ ๋ฐฉ๋ฒ (์ ์ฅ๋ ์ค์ ์๋ ๋ก๋)
from transformers import AutoModelForCausalLM, AutoTokenizer
# ์ ์ฅ๋ ์์ํ ์ค์ ์ ์๋์ผ๋ก ๋ก๋
model = AutoModelForCausalLM.from_pretrained(
"MyeongHo0621/eeve-vss-smh-bnb-4bit",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")
์์คํ ์๊ตฌ์ฌํญ
์ต์ ์ฌ์
| ๊ตฌ์ฑ ์์ | ์ต์ ์ฌ์ | ์ถ์ฒ ์ฌ์ |
|---|---|---|
| GPU | GTX 1660 (6GB) | RTX 3060 (12GB) |
| VRAM | 4GB | 6GB+ |
| RAM | 8GB | 16GB+ |
| CUDA | 11.0+ | 12.0+ |
ํ ์คํธ๋ ํ๊ฒฝ
- โ GTX 1660 (6GB VRAM) - ์คํ ๊ฐ๋ฅ
- โ RTX 2060 (6GB VRAM) - ์คํ ๊ฐ๋ฅ
- โ RTX 3060 (12GB VRAM) - ์ฌ์ ์์
- โ RTX 3090 (24GB VRAM) - ๋งค์ฐ ์ฌ์ ์์
- โ H100 (80GB VRAM) - ์ค๋ฒํฌ
์์ํ ์ธ๋ถ์ฌํญ
BitsAndBytes NF4
Quantization Type: NF4 (NormalFloat4)
Bits: 4-bit
Compute dtype: float16
Double Quantization: True
Method: Weight-only quantization
์ฑ๋ฅ ๋น๊ต
| ๋ฒ์ | ๋ชจ๋ธ ํฌ๊ธฐ | VRAM ์ฌ์ฉ | ํ์ง ์์ค | ์ถ๋ก ์๋ |
|---|---|---|---|---|
| FP16 ์๋ณธ | ~21GB | ~21GB | 0% | โกโกโกโก |
| BNB 4-bit | ~5.5GB | ~3.5GB | 1-2% | โกโกโก |
์ถ์ฒ ์์ฑ ํ๋ผ๋ฏธํฐ
generation_config = {
"max_new_tokens": 512,
"temperature": 0.3,
"top_p": 0.85,
"repetition_penalty": 1.0,
"do_sample": True,
"pad_token_id": tokenizer.pad_token_id,
"eos_token_id": tokenizer.eos_token_id,
}
์ฉ๋๋ณ ํ๋ผ๋ฏธํฐ
| ์ฉ๋ | Temperature | Top P | ์ค๋ช |
|---|---|---|---|
| ์ ํํ ๋ต๋ณ | 0.1-0.3 | 0.8-0.9 | ์ฌ์ค ๊ธฐ๋ฐ ์ง๋ฌธ |
| ๊ท ํ ๋ต๋ณ | 0.5-0.7 | 0.85-0.95 | ์ผ๋ฐ์ ์ฌ์ฉ |
| ์ฐฝ์์ ๋ต๋ณ | 0.8-1.0 | 0.9-1.0 | ์คํ ๋ฆฌ, ์ ๋ฑ |
์ฑ๋ฅ ์์
๋ฐ๋ง โ ์กด๋๋ง ๋ณํ
์ ๋ ฅ:
WMS๊ฐ ๋ญ์ผ?
์ถ๋ ฅ:
WMS๋ Warehouse Management System์ ์ฝ์๋ก, ์ฐฝ๊ณ ๊ด๋ฆฌ ์์คํ
์ ์๋ฏธํฉ๋๋ค.
์ฌ๊ณ ์ถ์ , ์
์ถ๊ณ ๊ด๋ฆฌ, ํผํน, ํจํน ๋ฑ์ ๋ฌผ๋ฅ ํ๋ก์ธ์ค๋ฅผ ์๋ํํ๊ณ ์ต์ ํํ๋
์ํํธ์จ์ด ์์คํ
์
๋๋ค. ํจ์จ์ ์ธ ์ฐฝ๊ณ ์ด์์ ์ํด ์ฌ์ฉ๋๋ฉฐ, ์ค์๊ฐ ์ฌ๊ณ ๊ฐ์์ฑ๊ณผ
์์
์์ฐ์ฑ ํฅ์์ ์ ๊ณตํฉ๋๋ค.
์ฝ๋ ์์ฑ
์ ๋ ฅ:
ํ์ด์ฌ์ผ๋ก ๋ฆฌ์คํธ๋ฅผ ์ญ์์ผ๋ก ๋ง๋ค์ด์ค
์ถ๋ ฅ:
# ์๋ณธ ๋ชจ๋ธ๊ณผ ๋์ผํ ๊ณ ํ์ง ์ฝ๋ ์์ฑ
my_list = [1, 2, 3, 4, 5]
# ๋ฐฉ๋ฒ 1: reverse()
my_list.reverse()
# ๋ฐฉ๋ฒ 2: ์ฌ๋ผ์ด์ฑ
reversed_list = my_list[::-1]
# ๋ฐฉ๋ฒ 3: reversed()
reversed_list = list(reversed(my_list))
์๋ณธ ๋ชจ๋ธ ์ ๋ณด
์ด ๋ชจ๋ธ์ ๋ค์ ๋ชจ๋ธ์ ์์ํ ๋ฒ์ ์ ๋๋ค:
- ์๋ณธ ๋ชจ๋ธ: MyeongHo0621/eeve-vss-smh
- ๋ฒ ์ด์ค ๋ชจ๋ธ: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
- ํ๋ จ ๋ฐ์ดํฐ: 100K+ ๊ณ ํ์ง ํ๊ตญ์ด instruction ๋ฐ์ดํฐ
- LoRA ์ค์ : r=64, alpha=128, dropout=0.05
์์ธํ ํ๋ จ ๊ณผ์ ์ ์๋ณธ ๋ชจ๋ธ ํ์ด์ง๋ฅผ ์ฐธ์กฐํ์ธ์.
๋ฌธ์ ํด๊ฒฐ
CUDA Out of Memory
# ๋ ๋ฎ์ ๋ฐฐ์น ํฌ๊ธฐ ์ฌ์ฉ
generation_config = {
"max_new_tokens": 256, # 512 โ 256
...
}
bitsandbytes ์ค์น ์ค๋ฅ
# CUDA ๋ฒ์ ํ์ธ
nvidia-smi
# CUDA 11.x
pip install bitsandbytes
# CUDA 12.x
pip install bitsandbytes --upgrade
๋๋ฆฐ ์์ฑ ์๋
- 4-bit ์์ํ๋ FP16๋ณด๋ค ์ฝ๊ฐ ๋๋ฆด ์ ์์ต๋๋ค
- ๋ ๋น ๋ฅธ ์๋๊ฐ ํ์ํ๋ฉด ์๋ณธ FP16 ๋ชจ๋ธ ์ฌ์ฉ ๊ถ์ฅ
์ฌ์ฉ ์ฌ๋ก
โ ์ ํฉํ ๊ฒฝ์ฐ
- ์ ์ฌ์ GPU (GTX 1660, RTX 2060)
- VRAM ์ ์ฝ์ด ์๋ ํ๊ฒฝ
- ๋ก์ปฌ ๊ฐ๋ฐ ๋ฐ ํ ์คํธ
- ๊ฐ์ธ ํ๋ก์ ํธ
- ์ฐ๊ตฌ ๋ฐ ๊ต์ก
โ ๏ธ ๊ถ์ฅํ์ง ์๋ ๊ฒฝ์ฐ
- ์ด๊ณ ์ ์ถ๋ก ์ด ํ์ํ ํ๋ก๋์
- ๊ณ ์ฌ์ GPU๊ฐ ์ถฉ๋ถํ ํ๊ฒฝ โ FP16 ์๋ณธ ์ฌ์ฉ
์ ํ์ฌํญ
- 4-bit ์์ํ๋ก ์ธํด ์ฝ 1-2% ํ์ง ์์ค ๊ฐ๋ฅ
- ์ถ๋ก ์๋๊ฐ FP16๋ณด๋ค ์ฝ๊ฐ ๋๋ฆผ
bitsandbytes๋ผ์ด๋ธ๋ฌ๋ฆฌ ํ์- Windows์์ bitsandbytes ์ค์น ์ ์ถ๊ฐ ์ค์ ํ์ํ ์ ์์
๋ผ์ด์ ์ค
- ๋ชจ๋ธ ๋ผ์ด์ ์ค: CC-BY-NC-SA-4.0
- ๋ฒ ์ด์ค ๋ชจ๋ธ: EEVE-Korean-Instruct-10.8B-v1.0
- ์์ ์ ์ฌ์ฉ: ์ ํ์ (๋ผ์ด์ ์ค ์ฐธ์กฐ)
Citation
@misc{eeve-vss-smh-bnb-4bit-2025,
author = {MyeongHo0621},
title = {EEVE-VSS-SMH-BNB-4bit: 4-bit Quantized Korean Model},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/MyeongHo0621/eeve-vss-smh-bnb-4bit}},
note = {4-bit quantized version using BitsAndBytes NF4}
}
Acknowledgments
- ์๋ณธ ๋ชจ๋ธ: MyeongHo0621/eeve-vss-smh
- ๋ฒ ์ด์ค ๋ชจ๋ธ: Yanolja EEVE
- ์์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ: BitsAndBytes
- ํ๋ ์์ํฌ: Hugging Face Transformers
๊ด๋ จ ๋ชจ๋ธ
| ๋ชจ๋ธ | ํฌ๊ธฐ | VRAM | ์ฉ๋ |
|---|---|---|---|
| eeve-vss-smh | 21GB | 21GB | ๊ณ ์ฌ์ GPU |
| eeve-vss-smh-bnb-4bit | 5.5GB | 3.5GB | ์ ์ฌ์ GPU โญ |
Contact
- ์๋ณธ ๋ชจ๋ธ: eeve-vss-smh
- Github : tuned_solar
์์ํ ์ผ์: 2025-10-11
๋ฐฉ๋ฒ: BitsAndBytes NF4 4-bit
์ํ: ์ ์ฌ์ GPU ๋ฐฐํฌ ์ค๋น ์๋ฃ ๐
- Downloads last month
- 3
Model tree for MyeongHo0621/eeve-vss-smh-bnb-4bit
Base model
upstage/SOLAR-10.7B-v1.0