EEVE-VSS-SMH-BNB-4bit

4-bit Quantized Version | 4-bit ์–‘์žํ™” ๋ฒ„์ „


English

Model Description

This model is a BitsAndBytes NF4 4-bit quantized version of MyeongHo0621/eeve-vss-smh.

Key Features

  • โœ… Low-VRAM Support: Works on GTX series GPUs with 6GB VRAM
  • โœ… 4-bit Quantization: NF4 (NormalFloat4) with minimal quality loss (1-2%)
  • โœ… High-Quality Korean: Maintains original model performance

Quick Start

Installation

pip install transformers torch bitsandbytes accelerate

Required: bitsandbytes library is mandatory!

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 4-bit configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-4bit",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")

# Prompt template
def create_prompt(user_input):
    return f"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: {user_input}
Assistant: """

# Generate
user_input = "Implement Fibonacci sequence in Python"
prompt = create_prompt(user_input)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.3,
    top_p=0.85,
    repetition_penalty=1.0,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

Alternative: Using torch.dtype Directly

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load with explicit dtype (automatic 4-bit loading)
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-4bit",
    device_map="auto",
    torch_dtype=torch.float16,  # or torch.bfloat16
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")

Simplified Method (Auto-load quantization config)

from transformers import AutoModelForCausalLM, AutoTokenizer

# Automatically loads saved quantization settings
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-4bit",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")

System Requirements

Minimum Specifications

Component Minimum Recommended
GPU GTX 1660 (6GB) RTX 3060 (12GB)
VRAM 4GB 6GB+
RAM 8GB 16GB+
CUDA 11.0+ 12.0+

Tested Environments

  • โœ… GTX 1660 (6GB VRAM) - Works
  • โœ… RTX 2060 (6GB VRAM) - Works
  • โœ… RTX 3060 (12GB VRAM) - Good
  • โœ… RTX 3090 (24GB VRAM) - Excellent
  • โœ… H100 (80GB VRAM) - Overkill

Quantization Details

BitsAndBytes NF4

Quantization Type: NF4 (NormalFloat4)
Bits: 4-bit
Compute dtype: float16
Double Quantization: True
Method: Weight-only quantization

Performance Comparison

Version Model Size VRAM Usage Quality Loss Inference Speed
FP16 Original ~21GB ~21GB 0% โšกโšกโšกโšก
BNB 4-bit ~5.5GB ~3.5GB 1-2% โšกโšกโšก

Recommended Generation Parameters

generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.3,
    "top_p": 0.85,
    "repetition_penalty": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.pad_token_id,
    "eos_token_id": tokenizer.eos_token_id,
}

Parameter Guide by Use Case

Use Case Temperature Top P Notes
Factual Answers 0.1-0.3 0.8-0.9 Fact-based questions
Balanced 0.5-0.7 0.85-0.95 General usage
Creative 0.8-1.0 0.9-1.0 Stories, poems

Example Outputs

Code Generation

Input:

Implement a Python function to reverse a list

Output:

# High-quality code generation like the original model
my_list = [1, 2, 3, 4, 5]

# Method 1: reverse()
my_list.reverse()

# Method 2: slicing
reversed_list = my_list[::-1]

# Method 3: reversed()
reversed_list = list(reversed(my_list))

Original Model Information

This is a quantized version of:

For detailed training process, see original model page.

Troubleshooting

CUDA Out of Memory

# Use lower batch size
generation_config = {
    "max_new_tokens": 256,  # 512 โ†’ 256
    ...
}

bitsandbytes Installation Error

# Check CUDA version
nvidia-smi

# CUDA 11.x
pip install bitsandbytes

# CUDA 12.x
pip install bitsandbytes --upgrade

Slow Generation Speed

  • 4-bit quantization may be slightly slower than FP16
  • For faster speed, use FP16 original model

Use Cases

โœ… Suitable For

  • Low-end GPUs (GTX 1660, RTX 2060)
  • VRAM-constrained environments
  • Local development and testing
  • Personal projects
  • Research and education

โš ๏ธ Not Recommended For

  • Production requiring ultra-fast inference
  • Environments with sufficient high-end GPUs โ†’ Use FP16 original

Limitations

  • ~1-2% quality loss due to 4-bit quantization
  • Slightly slower inference than FP16
  • Requires bitsandbytes library
  • Windows may require additional setup for bitsandbytes

License

Citation

@misc{eeve-vss-smh-bnb-4bit-2025,
  author = {MyeongHo0621},
  title = {EEVE-VSS-SMH-BNB-4bit: 4-bit Quantized Korean Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/MyeongHo0621/eeve-vss-smh-bnb-4bit}},
  note = {4-bit quantized version using BitsAndBytes NF4}
}

Acknowledgments

Related Models

Model Size VRAM Use Case
eeve-vss-smh 21GB 21GB High-end GPUs
eeve-vss-smh-bnb-4bit 5.5GB 3.5GB Low-end GPUs โญ

Contact


Quantization Date: 2025-10-11
Method: BitsAndBytes NF4 4-bit
Status: Ready for Low-VRAM Deployment ๐Ÿš€


ํ•œ๊ตญ์–ด

๋ชจ๋ธ ์†Œ๊ฐœ

์ด ๋ชจ๋ธ์€ MyeongHo0621/eeve-vss-smh๋ฅผ BitsAndBytes NF4 4-bit๋กœ ์–‘์žํ™”ํ•œ ๋ฒ„์ „์ž…๋‹ˆ๋‹ค.

์ฃผ์š” ํŠน์ง•

  • โœ… ์ €์‚ฌ์–‘ GPU ์ง€์›: GTX ์‹œ๋ฆฌ์ฆˆ, 6GB VRAM์—์„œ๋„ ์‹คํ–‰ ๊ฐ€๋Šฅ
  • โœ… 4-bit ์–‘์žํ™”: NF4 (NormalFloat4) - ํ’ˆ์งˆ ์†์‹ค ์ตœ์†Œ (1-2%)
  • โœ… ๊ณ ํ’ˆ์งˆ ํ•œ๊ตญ์–ด: ์›๋ณธ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ์œ ์ง€

๋น ๋ฅธ ์‹œ์ž‘

์„ค์น˜

pip install transformers torch bitsandbytes accelerate

ํ•„์ˆ˜: bitsandbytes ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค!

๊ธฐ๋ณธ ์‚ฌ์šฉ

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 4-bit ์„ค์ •
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# ๋ชจ๋ธ ๋กœ๋“œ
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-4bit",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")

# ํ”„๋กฌํ”„ํŠธ ํ…œํ”Œ๋ฆฟ
def create_prompt(user_input):
    return f"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: {user_input}
Assistant: """

# ๋Œ€ํ™”
user_input = "ํŒŒ์ด์ฌ์œผ๋กœ ํ”ผ๋ณด๋‚˜์น˜ ์ˆ˜์—ด ๊ตฌํ˜„ํ•ด์ค˜"
prompt = create_prompt(user_input)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.3,
    top_p=0.85,
    repetition_penalty=1.0,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

๋Œ€์•ˆ: torch.dtype ์ง์ ‘ ์‚ฌ์šฉ

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# dtype ๋ช…์‹œ์  ์ง€์ • (์ž๋™ 4-bit ๋กœ๋”ฉ)
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-4bit",
    device_map="auto",
    torch_dtype=torch.float16,  # ๋˜๋Š” torch.bfloat16
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")

๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ• (์ €์žฅ๋œ ์„ค์ • ์ž๋™ ๋กœ๋“œ)

from transformers import AutoModelForCausalLM, AutoTokenizer

# ์ €์žฅ๋œ ์–‘์žํ™” ์„ค์ •์„ ์ž๋™์œผ๋กœ ๋กœ๋“œ
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-4bit",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")

์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ

์ตœ์†Œ ์‚ฌ์–‘

๊ตฌ์„ฑ ์š”์†Œ ์ตœ์†Œ ์‚ฌ์–‘ ์ถ”์ฒœ ์‚ฌ์–‘
GPU GTX 1660 (6GB) RTX 3060 (12GB)
VRAM 4GB 6GB+
RAM 8GB 16GB+
CUDA 11.0+ 12.0+

ํ…Œ์ŠคํŠธ๋œ ํ™˜๊ฒฝ

  • โœ… GTX 1660 (6GB VRAM) - ์‹คํ–‰ ๊ฐ€๋Šฅ
  • โœ… RTX 2060 (6GB VRAM) - ์‹คํ–‰ ๊ฐ€๋Šฅ
  • โœ… RTX 3060 (12GB VRAM) - ์—ฌ์œ ์žˆ์Œ
  • โœ… RTX 3090 (24GB VRAM) - ๋งค์šฐ ์—ฌ์œ ์žˆ์Œ
  • โœ… H100 (80GB VRAM) - ์˜ค๋ฒ„ํ‚ฌ

์–‘์žํ™” ์„ธ๋ถ€์‚ฌํ•ญ

BitsAndBytes NF4

Quantization Type: NF4 (NormalFloat4)
Bits: 4-bit
Compute dtype: float16
Double Quantization: True
Method: Weight-only quantization

์„ฑ๋Šฅ ๋น„๊ต

๋ฒ„์ „ ๋ชจ๋ธ ํฌ๊ธฐ VRAM ์‚ฌ์šฉ ํ’ˆ์งˆ ์†์‹ค ์ถ”๋ก  ์†๋„
FP16 ์›๋ณธ ~21GB ~21GB 0% โšกโšกโšกโšก
BNB 4-bit ~5.5GB ~3.5GB 1-2% โšกโšกโšก

์ถ”์ฒœ ์ƒ์„ฑ ํŒŒ๋ผ๋ฏธํ„ฐ

generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.3,
    "top_p": 0.85,
    "repetition_penalty": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.pad_token_id,
    "eos_token_id": tokenizer.eos_token_id,
}

์šฉ๋„๋ณ„ ํŒŒ๋ผ๋ฏธํ„ฐ

์šฉ๋„ Temperature Top P ์„ค๋ช…
์ •ํ™•ํ•œ ๋‹ต๋ณ€ 0.1-0.3 0.8-0.9 ์‚ฌ์‹ค ๊ธฐ๋ฐ˜ ์งˆ๋ฌธ
๊ท ํ˜• ๋‹ต๋ณ€ 0.5-0.7 0.85-0.95 ์ผ๋ฐ˜์  ์‚ฌ์šฉ
์ฐฝ์˜์  ๋‹ต๋ณ€ 0.8-1.0 0.9-1.0 ์Šคํ† ๋ฆฌ, ์‹œ ๋“ฑ

์„ฑ๋Šฅ ์˜ˆ์‹œ

๋ฐ˜๋ง โ†’ ์กด๋Œ“๋ง ๋ณ€ํ™˜

์ž…๋ ฅ:

WMS๊ฐ€ ๋ญ์•ผ?

์ถœ๋ ฅ:

WMS๋Š” Warehouse Management System์˜ ์•ฝ์ž๋กœ, ์ฐฝ๊ณ  ๊ด€๋ฆฌ ์‹œ์Šคํ…œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
์žฌ๊ณ  ์ถ”์ , ์ž…์ถœ๊ณ  ๊ด€๋ฆฌ, ํ”ผํ‚น, ํŒจํ‚น ๋“ฑ์˜ ๋ฌผ๋ฅ˜ ํ”„๋กœ์„ธ์Šค๋ฅผ ์ž๋™ํ™”ํ•˜๊ณ  ์ตœ์ ํ™”ํ•˜๋Š”
์†Œํ”„ํŠธ์›จ์–ด ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค. ํšจ์œจ์ ์ธ ์ฐฝ๊ณ  ์šด์˜์„ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋ฉฐ, ์‹ค์‹œ๊ฐ„ ์žฌ๊ณ  ๊ฐ€์‹œ์„ฑ๊ณผ
์ž‘์—… ์ƒ์‚ฐ์„ฑ ํ–ฅ์ƒ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์ฝ”๋“œ ์ƒ์„ฑ

์ž…๋ ฅ:

ํŒŒ์ด์ฌ์œผ๋กœ ๋ฆฌ์ŠคํŠธ๋ฅผ ์—ญ์ˆœ์œผ๋กœ ๋งŒ๋“ค์–ด์ค˜

์ถœ๋ ฅ:

# ์›๋ณธ ๋ชจ๋ธ๊ณผ ๋™์ผํ•œ ๊ณ ํ’ˆ์งˆ ์ฝ”๋“œ ์ƒ์„ฑ
my_list = [1, 2, 3, 4, 5]

# ๋ฐฉ๋ฒ• 1: reverse()
my_list.reverse()

# ๋ฐฉ๋ฒ• 2: ์Šฌ๋ผ์ด์‹ฑ
reversed_list = my_list[::-1]

# ๋ฐฉ๋ฒ• 3: reversed()
reversed_list = list(reversed(my_list))

์›๋ณธ ๋ชจ๋ธ ์ •๋ณด

์ด ๋ชจ๋ธ์€ ๋‹ค์Œ ๋ชจ๋ธ์˜ ์–‘์žํ™” ๋ฒ„์ „์ž…๋‹ˆ๋‹ค:

  • ์›๋ณธ ๋ชจ๋ธ: MyeongHo0621/eeve-vss-smh
  • ๋ฒ ์ด์Šค ๋ชจ๋ธ: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ: 100K+ ๊ณ ํ’ˆ์งˆ ํ•œ๊ตญ์–ด instruction ๋ฐ์ดํ„ฐ
  • LoRA ์„ค์ •: r=64, alpha=128, dropout=0.05

์ž์„ธํ•œ ํ›ˆ๋ จ ๊ณผ์ •์€ ์›๋ณธ ๋ชจ๋ธ ํŽ˜์ด์ง€๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

๋ฌธ์ œ ํ•ด๊ฒฐ

CUDA Out of Memory

# ๋” ๋‚ฎ์€ ๋ฐฐ์น˜ ํฌ๊ธฐ ์‚ฌ์šฉ
generation_config = {
    "max_new_tokens": 256,  # 512 โ†’ 256
    ...
}

bitsandbytes ์„ค์น˜ ์˜ค๋ฅ˜

# CUDA ๋ฒ„์ „ ํ™•์ธ
nvidia-smi

# CUDA 11.x
pip install bitsandbytes

# CUDA 12.x
pip install bitsandbytes --upgrade

๋А๋ฆฐ ์ƒ์„ฑ ์†๋„

  • 4-bit ์–‘์žํ™”๋Š” FP16๋ณด๋‹ค ์•ฝ๊ฐ„ ๋А๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ๋” ๋น ๋ฅธ ์†๋„๊ฐ€ ํ•„์š”ํ•˜๋ฉด ์›๋ณธ FP16 ๋ชจ๋ธ ์‚ฌ์šฉ ๊ถŒ์žฅ

์‚ฌ์šฉ ์‚ฌ๋ก€

โœ… ์ ํ•ฉํ•œ ๊ฒฝ์šฐ

  • ์ €์‚ฌ์–‘ GPU (GTX 1660, RTX 2060)
  • VRAM ์ œ์•ฝ์ด ์žˆ๋Š” ํ™˜๊ฒฝ
  • ๋กœ์ปฌ ๊ฐœ๋ฐœ ๋ฐ ํ…Œ์ŠคํŠธ
  • ๊ฐœ์ธ ํ”„๋กœ์ ํŠธ
  • ์—ฐ๊ตฌ ๋ฐ ๊ต์œก

โš ๏ธ ๊ถŒ์žฅํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ

  • ์ดˆ๊ณ ์† ์ถ”๋ก ์ด ํ•„์š”ํ•œ ํ”„๋กœ๋•์…˜
  • ๊ณ ์‚ฌ์–‘ GPU๊ฐ€ ์ถฉ๋ถ„ํ•œ ํ™˜๊ฒฝ โ†’ FP16 ์›๋ณธ ์‚ฌ์šฉ

์ œํ•œ์‚ฌํ•ญ

  • 4-bit ์–‘์žํ™”๋กœ ์ธํ•ด ์•ฝ 1-2% ํ’ˆ์งˆ ์†์‹ค ๊ฐ€๋Šฅ
  • ์ถ”๋ก  ์†๋„๊ฐ€ FP16๋ณด๋‹ค ์•ฝ๊ฐ„ ๋А๋ฆผ
  • bitsandbytes ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ•„์ˆ˜
  • Windows์—์„œ bitsandbytes ์„ค์น˜ ์‹œ ์ถ”๊ฐ€ ์„ค์ • ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Œ

๋ผ์ด์„ ์Šค

  • ๋ชจ๋ธ ๋ผ์ด์„ ์Šค: CC-BY-NC-SA-4.0
  • ๋ฒ ์ด์Šค ๋ชจ๋ธ: EEVE-Korean-Instruct-10.8B-v1.0
  • ์ƒ์—…์  ์‚ฌ์šฉ: ์ œํ•œ์  (๋ผ์ด์„ ์Šค ์ฐธ์กฐ)

Citation

@misc{eeve-vss-smh-bnb-4bit-2025,
  author = {MyeongHo0621},
  title = {EEVE-VSS-SMH-BNB-4bit: 4-bit Quantized Korean Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/MyeongHo0621/eeve-vss-smh-bnb-4bit}},
  note = {4-bit quantized version using BitsAndBytes NF4}
}

Acknowledgments

  • ์›๋ณธ ๋ชจ๋ธ: MyeongHo0621/eeve-vss-smh
  • ๋ฒ ์ด์Šค ๋ชจ๋ธ: Yanolja EEVE
  • ์–‘์žํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ: BitsAndBytes
  • ํ”„๋ ˆ์ž„์›Œํฌ: Hugging Face Transformers

๊ด€๋ จ ๋ชจ๋ธ

๋ชจ๋ธ ํฌ๊ธฐ VRAM ์šฉ๋„
eeve-vss-smh 21GB 21GB ๊ณ ์‚ฌ์–‘ GPU
eeve-vss-smh-bnb-4bit 5.5GB 3.5GB ์ €์‚ฌ์–‘ GPU โญ

Contact


์–‘์žํ™” ์ผ์ž: 2025-10-11
๋ฐฉ๋ฒ•: BitsAndBytes NF4 4-bit
์ƒํƒœ: ์ €์‚ฌ์–‘ GPU ๋ฐฐํฌ ์ค€๋น„ ์™„๋ฃŒ ๐Ÿš€

Downloads last month
3
Safetensors
Model size
11B params
Tensor type
F32
ยท
F16
ยท
U8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MyeongHo0621/eeve-vss-smh-bnb-4bit