EEVE-VSS-SMH-BNB-4bit

4-bit Quantized Version | 4-bit 양자화 버전

English

Model Description

This model is a BitsAndBytes NF4 4-bit quantized version of MyeongHo0621/eeve-vss-smh.

Key Features

✅ Low-VRAM Support: Works on GTX series GPUs with 6GB VRAM
✅ 4-bit Quantization: NF4 (NormalFloat4) with minimal quality loss (1-2%)
✅ High-Quality Korean: Maintains original model performance

Quick Start

Installation

pip install transformers torch bitsandbytes accelerate

Required: bitsandbytes library is mandatory!

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 4-bit configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-4bit",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")

# Prompt template
def create_prompt(user_input):
    return f"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: {user_input}
Assistant: """

# Generate
user_input = "Implement Fibonacci sequence in Python"
prompt = create_prompt(user_input)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.3,
    top_p=0.85,
    repetition_penalty=1.0,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

Alternative: Using torch.dtype Directly

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load with explicit dtype (automatic 4-bit loading)
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-4bit",
    device_map="auto",
    torch_dtype=torch.float16,  # or torch.bfloat16
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")

Simplified Method (Auto-load quantization config)

from transformers import AutoModelForCausalLM, AutoTokenizer

# Automatically loads saved quantization settings
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-4bit",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")

System Requirements

Minimum Specifications

Component	Minimum	Recommended
GPU	GTX 1660 (6GB)	RTX 3060 (12GB)
VRAM	4GB	6GB+
RAM	8GB	16GB+
CUDA	11.0+	12.0+

Tested Environments

✅ GTX 1660 (6GB VRAM) - Works
✅ RTX 2060 (6GB VRAM) - Works
✅ RTX 3060 (12GB VRAM) - Good
✅ RTX 3090 (24GB VRAM) - Excellent
✅ H100 (80GB VRAM) - Overkill

Quantization Details

BitsAndBytes NF4

Quantization Type: NF4 (NormalFloat4)
Bits: 4-bit
Compute dtype: float16
Double Quantization: True
Method: Weight-only quantization

Performance Comparison

Version	Model Size	VRAM Usage	Quality Loss	Inference Speed
FP16 Original	~21GB	~21GB	0%	⚡⚡⚡⚡
BNB 4-bit	~5.5GB	~3.5GB	1-2%	⚡⚡⚡

Recommended Generation Parameters

generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.3,
    "top_p": 0.85,
    "repetition_penalty": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.pad_token_id,
    "eos_token_id": tokenizer.eos_token_id,
}

Parameter Guide by Use Case

Use Case	Temperature	Top P	Notes
Factual Answers	0.1-0.3	0.8-0.9	Fact-based questions
Balanced	0.5-0.7	0.85-0.95	General usage
Creative	0.8-1.0	0.9-1.0	Stories, poems

Example Outputs

Code Generation

Input:

Implement a Python function to reverse a list

Output:

# High-quality code generation like the original model
my_list = [1, 2, 3, 4, 5]

# Method 1: reverse()
my_list.reverse()

# Method 2: slicing
reversed_list = my_list[::-1]

# Method 3: reversed()
reversed_list = list(reversed(my_list))

Original Model Information

This is a quantized version of:

Original Model: MyeongHo0621/eeve-vss-smh
Base Model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
Training Data: 100K+ high-quality Korean instruction data
LoRA Config: r=64, alpha=128, dropout=0.05

For detailed training process, see original model page.

Troubleshooting

CUDA Out of Memory

# Use lower batch size
generation_config = {
    "max_new_tokens": 256,  # 512 → 256
    ...
}

bitsandbytes Installation Error

# Check CUDA version
nvidia-smi

# CUDA 11.x
pip install bitsandbytes

# CUDA 12.x
pip install bitsandbytes --upgrade

Slow Generation Speed

4-bit quantization may be slightly slower than FP16
For faster speed, use FP16 original model

Use Cases

✅ Suitable For

Low-end GPUs (GTX 1660, RTX 2060)
VRAM-constrained environments
Local development and testing
Personal projects
Research and education

⚠️ Not Recommended For

Production requiring ultra-fast inference
Environments with sufficient high-end GPUs → Use FP16 original

Limitations

~1-2% quality loss due to 4-bit quantization
Slightly slower inference than FP16
Requires bitsandbytes library
Windows may require additional setup for bitsandbytes

License

Model License: CC-BY-NC-SA-4.0
Base Model: EEVE-Korean-Instruct-10.8B-v1.0
Commercial Use: Limited (see license)

Citation

@misc{eeve-vss-smh-bnb-4bit-2025,
  author = {MyeongHo0621},
  title = {EEVE-VSS-SMH-BNB-4bit: 4-bit Quantized Korean Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/MyeongHo0621/eeve-vss-smh-bnb-4bit}},
  note = {4-bit quantized version using BitsAndBytes NF4}
}

Acknowledgments

Original Model: MyeongHo0621/eeve-vss-smh
Base Model: Yanolja EEVE
Quantization Library: BitsAndBytes
Framework: Hugging Face Transformers

Related Models

Model	Size	VRAM	Use Case
eeve-vss-smh	21GB	21GB	High-end GPUs
eeve-vss-smh-bnb-4bit	5.5GB	3.5GB	Low-end GPUs ⭐

Contact

Original Model: eeve-vss-smh

Quantization Date: 2025-10-11
Method: BitsAndBytes NF4 4-bit
Status: Ready for Low-VRAM Deployment 🚀

한국어

모델 소개

이 모델은 MyeongHo0621/eeve-vss-smh를 BitsAndBytes NF4 4-bit로 양자화한 버전입니다.

주요 특징

✅ 저사양 GPU 지원: GTX 시리즈, 6GB VRAM에서도 실행 가능
✅ 4-bit 양자화: NF4 (NormalFloat4) - 품질 손실 최소 (1-2%)
✅ 고품질 한국어: 원본 모델의 성능 유지

빠른 시작

설치

pip install transformers torch bitsandbytes accelerate

필수: bitsandbytes 라이브러리가 반드시 필요합니다!

기본 사용

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

# 4-bit 설정
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# 모델 로드
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-4bit",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")

# 프롬프트 템플릿
def create_prompt(user_input):
    return f"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: {user_input}
Assistant: """

# 대화
user_input = "파이썬으로 피보나치 수열 구현해줘"
prompt = create_prompt(user_input)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.3,
    top_p=0.85,
    repetition_penalty=1.0,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

대안: torch.dtype 직접 사용

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# dtype 명시적 지정 (자동 4-bit 로딩)
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-4bit",
    device_map="auto",
    torch_dtype=torch.float16,  # 또는 torch.bfloat16
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")

간단한 방법 (저장된 설정 자동 로드)

from transformers import AutoModelForCausalLM, AutoTokenizer

# 저장된 양자화 설정을 자동으로 로드
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/eeve-vss-smh-bnb-4bit",
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit")

시스템 요구사항

최소 사양

구성 요소	최소 사양	추천 사양
GPU	GTX 1660 (6GB)	RTX 3060 (12GB)
VRAM	4GB	6GB+
RAM	8GB	16GB+
CUDA	11.0+	12.0+

테스트된 환경

✅ GTX 1660 (6GB VRAM) - 실행 가능
✅ RTX 2060 (6GB VRAM) - 실행 가능
✅ RTX 3060 (12GB VRAM) - 여유있음
✅ RTX 3090 (24GB VRAM) - 매우 여유있음
✅ H100 (80GB VRAM) - 오버킬

양자화 세부사항

BitsAndBytes NF4

Quantization Type: NF4 (NormalFloat4)
Bits: 4-bit
Compute dtype: float16
Double Quantization: True
Method: Weight-only quantization

성능 비교

버전	모델 크기	VRAM 사용	품질 손실	추론 속도
FP16 원본	~21GB	~21GB	0%	⚡⚡⚡⚡
BNB 4-bit	~5.5GB	~3.5GB	1-2%	⚡⚡⚡

용도	Temperature	Top P	설명
정확한 답변	0.1-0.3	0.8-0.9	사실 기반 질문
균형 답변	0.5-0.7	0.85-0.95	일반적 사용
창의적 답변	0.8-1.0	0.9-1.0	스토리, 시 등

성능 예시

반말 → 존댓말 변환

입력:

WMS가 뭐야?

출력:

WMS는 Warehouse Management System의 약자로, 창고 관리 시스템을 의미합니다.
재고 추적, 입출고 관리, 피킹, 패킹 등의 물류 프로세스를 자동화하고 최적화하는
소프트웨어 시스템입니다. 효율적인 창고 운영을 위해 사용되며, 실시간 재고 가시성과
작업 생산성 향상을 제공합니다.

코드 생성

입력:

파이썬으로 리스트를 역순으로 만들어줘

출력:

# 원본 모델과 동일한 고품질 코드 생성
my_list = [1, 2, 3, 4, 5]

# 방법 1: reverse()
my_list.reverse()

# 방법 2: 슬라이싱
reversed_list = my_list[::-1]

# 방법 3: reversed()
reversed_list = list(reversed(my_list))

원본 모델 정보

이 모델은 다음 모델의 양자화 버전입니다:

원본 모델: MyeongHo0621/eeve-vss-smh
베이스 모델: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
훈련 데이터: 100K+ 고품질 한국어 instruction 데이터
LoRA 설정: r=64, alpha=128, dropout=0.05

자세한 훈련 과정은 원본 모델 페이지를 참조하세요.

문제 해결

CUDA Out of Memory

# 더 낮은 배치 크기 사용
generation_config = {
    "max_new_tokens": 256,  # 512 → 256
    ...
}

bitsandbytes 설치 오류

# CUDA 버전 확인
nvidia-smi

# CUDA 11.x
pip install bitsandbytes

# CUDA 12.x
pip install bitsandbytes --upgrade

느린 생성 속도

4-bit 양자화는 FP16보다 약간 느릴 수 있습니다
더 빠른 속도가 필요하면 원본 FP16 모델 사용 권장

사용 사례

✅ 적합한 경우

저사양 GPU (GTX 1660, RTX 2060)
VRAM 제약이 있는 환경
로컬 개발 및 테스트
개인 프로젝트
연구 및 교육

⚠️ 권장하지 않는 경우

초고속 추론이 필요한 프로덕션
고사양 GPU가 충분한 환경 → FP16 원본 사용

제한사항

4-bit 양자화로 인해 약 1-2% 품질 손실 가능
추론 속도가 FP16보다 약간 느림
bitsandbytes 라이브러리 필수
Windows에서 bitsandbytes 설치 시 추가 설정 필요할 수 있음

라이선스

모델 라이선스: CC-BY-NC-SA-4.0
베이스 모델: EEVE-Korean-Instruct-10.8B-v1.0
상업적 사용: 제한적 (라이선스 참조)

Citation

@misc{eeve-vss-smh-bnb-4bit-2025,
  author = {MyeongHo0621},
  title = {EEVE-VSS-SMH-BNB-4bit: 4-bit Quantized Korean Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/MyeongHo0621/eeve-vss-smh-bnb-4bit}},
  note = {4-bit quantized version using BitsAndBytes NF4}
}

Acknowledgments

원본 모델: MyeongHo0621/eeve-vss-smh
베이스 모델: Yanolja EEVE
양자화 라이브러리: BitsAndBytes
프레임워크: Hugging Face Transformers

모델	크기	VRAM	용도
eeve-vss-smh	21GB	21GB	고사양 GPU
eeve-vss-smh-bnb-4bit	5.5GB	3.5GB	저사양 GPU ⭐

Contact

원본 모델: eeve-vss-smh
Github : tuned_solar

양자화 일자: 2025-10-11
방법: BitsAndBytes NF4 4-bit
상태: 저사양 GPU 배포 준비 완료 🚀

Downloads last month: 3

Safetensors

Model size

11B params

Tensor type

F32

F16

Model tree for MyeongHo0621/eeve-vss-smh-bnb-4bit

Base model

upstage/SOLAR-10.7B-v1.0

Finetuned

yanolja/YanoljaNEXT-EEVE-10.8B

Finetuned

yanolja/YanoljaNEXT-EEVE-Instruct-10.8B

Adapter

(26)

this model

EEVE-VSS-SMH-BNB-4bit

English

Model Description

Key Features

Quick Start

Installation

Basic Usage

Alternative: Using torch.dtype Directly

Simplified Method (Auto-load quantization config)

System Requirements

Minimum Specifications

Tested Environments

Quantization Details

BitsAndBytes NF4

Performance Comparison

Recommended Generation Parameters

Parameter Guide by Use Case

Example Outputs

Code Generation

Original Model Information

Troubleshooting

CUDA Out of Memory

bitsandbytes Installation Error

Slow Generation Speed

Use Cases

✅ Suitable For

⚠️ Not Recommended For

Limitations

License

Citation

Acknowledgments

Related Models

Contact

한국어

모델 소개

주요 특징

빠른 시작

설치

기본 사용

대안: torch.dtype 직접 사용

간단한 방법 (저장된 설정 자동 로드)

시스템 요구사항

최소 사양

테스트된 환경

양자화 세부사항

BitsAndBytes NF4

성능 비교

추천 생성 파라미터

용도별 파라미터

성능 예시

반말 → 존댓말 변환

코드 생성

원본 모델 정보

문제 해결

CUDA Out of Memory

bitsandbytes 설치 오류

느린 생성 속도

사용 사례

✅ 적합한 경우

⚠️ 권장하지 않는 경우

제한사항

라이선스

Citation

Acknowledgments

관련 모델

Contact

Model tree for MyeongHo0621/eeve-vss-smh-bnb-4bit