--- language: - ko - en license: cc-by-nc-sa-4.0 base_model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0 tags: - korean - instruction-tuning - lora - merged - quantized - 4-bit - bitsandbytes - low-vram library_name: transformers pipeline_tag: text-generation --- # EEVE-VSS-SMH-BNB-4bit > **4-bit Quantized Version** | **4-bit 양자화 버전** --- ## English ### Model Description This model is a **BitsAndBytes NF4 4-bit quantized** version of [MyeongHo0621/eeve-vss-smh](https://huggingface.co/MyeongHo0621/eeve-vss-smh). #### Key Features - ✅ **Low-VRAM Support**: Works on GTX series GPUs with 6GB VRAM - ✅ **4-bit Quantization**: NF4 (NormalFloat4) with minimal quality loss (1-2%) - ✅ **High-Quality Korean**: Maintains original model performance ### Quick Start #### Installation ```bash pip install transformers torch bitsandbytes accelerate ``` **Required**: `bitsandbytes` library is mandatory! #### Basic Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch # 4-bit configuration bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4" ) # Load model model = AutoModelForCausalLM.from_pretrained( "MyeongHo0621/eeve-vss-smh-bnb-4bit", quantization_config=bnb_config, device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit") # Prompt template def create_prompt(user_input): return f"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. Human: {user_input} Assistant: """ # Generate user_input = "Implement Fibonacci sequence in Python" prompt = create_prompt(user_input) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.3, top_p=0.85, repetition_penalty=1.0, do_sample=True, pad_token_id=tokenizer.eos_token_id ) response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) print(response) ``` #### Alternative: Using torch.dtype Directly ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Load with explicit dtype (automatic 4-bit loading) model = AutoModelForCausalLM.from_pretrained( "MyeongHo0621/eeve-vss-smh-bnb-4bit", device_map="auto", torch_dtype=torch.float16, # or torch.bfloat16 trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit") ``` #### Simplified Method (Auto-load quantization config) ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Automatically loads saved quantization settings model = AutoModelForCausalLM.from_pretrained( "MyeongHo0621/eeve-vss-smh-bnb-4bit", device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit") ``` ### System Requirements #### Minimum Specifications | Component | Minimum | Recommended | |-----------|---------|-------------| | **GPU** | GTX 1660 (6GB) | RTX 3060 (12GB) | | **VRAM** | 4GB | 6GB+ | | **RAM** | 8GB | 16GB+ | | **CUDA** | 11.0+ | 12.0+ | #### Tested Environments - ✅ GTX 1660 (6GB VRAM) - Works - ✅ RTX 2060 (6GB VRAM) - Works - ✅ RTX 3060 (12GB VRAM) - Good - ✅ RTX 3090 (24GB VRAM) - Excellent - ✅ H100 (80GB VRAM) - Overkill ### Quantization Details #### BitsAndBytes NF4 ```yaml Quantization Type: NF4 (NormalFloat4) Bits: 4-bit Compute dtype: float16 Double Quantization: True Method: Weight-only quantization ``` #### Performance Comparison | Version | Model Size | VRAM Usage | Quality Loss | Inference Speed | |---------|-----------|------------|--------------|-----------------| | **FP16 Original** | ~21GB | ~21GB | 0% | ⚡⚡⚡⚡ | | **BNB 4-bit** | ~5.5GB | ~3.5GB | 1-2% | ⚡⚡⚡ | ### Recommended Generation Parameters ```python generation_config = { "max_new_tokens": 512, "temperature": 0.3, "top_p": 0.85, "repetition_penalty": 1.0, "do_sample": True, "pad_token_id": tokenizer.pad_token_id, "eos_token_id": tokenizer.eos_token_id, } ``` #### Parameter Guide by Use Case | Use Case | Temperature | Top P | Notes | |----------|-------------|-------|-------| | **Factual Answers** | 0.1-0.3 | 0.8-0.9 | Fact-based questions | | **Balanced** | 0.5-0.7 | 0.85-0.95 | General usage | | **Creative** | 0.8-1.0 | 0.9-1.0 | Stories, poems | ### Example Outputs #### Code Generation **Input**: ``` Implement a Python function to reverse a list ``` **Output**: ```python # High-quality code generation like the original model my_list = [1, 2, 3, 4, 5] # Method 1: reverse() my_list.reverse() # Method 2: slicing reversed_list = my_list[::-1] # Method 3: reversed() reversed_list = list(reversed(my_list)) ``` ### Original Model Information This is a quantized version of: - **Original Model**: [MyeongHo0621/eeve-vss-smh](https://huggingface.co/MyeongHo0621/eeve-vss-smh) - **Base Model**: [yanolja/EEVE-Korean-Instruct-10.8B-v1.0](https://huggingface.co/yanolja/EEVE-Korean-Instruct-10.8B-v1.0) - **Training Data**: 100K+ high-quality Korean instruction data - **LoRA Config**: r=64, alpha=128, dropout=0.05 For detailed training process, see [original model page](https://huggingface.co/MyeongHo0621/eeve-vss-smh). ### Troubleshooting #### CUDA Out of Memory ```python # Use lower batch size generation_config = { "max_new_tokens": 256, # 512 → 256 ... } ``` #### bitsandbytes Installation Error ```bash # Check CUDA version nvidia-smi # CUDA 11.x pip install bitsandbytes # CUDA 12.x pip install bitsandbytes --upgrade ``` #### Slow Generation Speed - 4-bit quantization may be slightly slower than FP16 - For faster speed, use [FP16 original model](https://huggingface.co/MyeongHo0621/eeve-vss-smh) ### Use Cases #### ✅ Suitable For - Low-end GPUs (GTX 1660, RTX 2060) - VRAM-constrained environments - Local development and testing - Personal projects - Research and education #### ⚠️ Not Recommended For - Production requiring ultra-fast inference - Environments with sufficient high-end GPUs → Use [FP16 original](https://huggingface.co/MyeongHo0621/eeve-vss-smh) ### Limitations - **~1-2% quality loss** due to 4-bit quantization - Slightly slower inference than FP16 - Requires `bitsandbytes` library - Windows may require additional setup for bitsandbytes ### License - **Model License**: CC-BY-NC-SA-4.0 - **Base Model**: [EEVE-Korean-Instruct-10.8B-v1.0](https://huggingface.co/yanolja/EEVE-Korean-Instruct-10.8B-v1.0) - **Commercial Use**: Limited (see license) ### Citation ```bibtex @misc{eeve-vss-smh-bnb-4bit-2025, author = {MyeongHo0621}, title = {EEVE-VSS-SMH-BNB-4bit: 4-bit Quantized Korean Model}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/MyeongHo0621/eeve-vss-smh-bnb-4bit}}, note = {4-bit quantized version using BitsAndBytes NF4} } ``` ### Acknowledgments - **Original Model**: [MyeongHo0621/eeve-vss-smh](https://huggingface.co/MyeongHo0621/eeve-vss-smh) - **Base Model**: [Yanolja EEVE](https://huggingface.co/yanolja/EEVE-Korean-Instruct-10.8B-v1.0) - **Quantization Library**: [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) - **Framework**: Hugging Face Transformers ### Related Models | Model | Size | VRAM | Use Case | |-------|------|------|----------| | [eeve-vss-smh](https://huggingface.co/MyeongHo0621/eeve-vss-smh) | 21GB | 21GB | High-end GPUs | | **eeve-vss-smh-bnb-4bit** | 5.5GB | 3.5GB | Low-end GPUs ⭐ | ### Contact - **Original Model**: [eeve-vss-smh](https://huggingface.co/MyeongHo0621/eeve-vss-smh) --- **Quantization Date**: 2025-10-11 **Method**: BitsAndBytes NF4 4-bit **Status**: Ready for Low-VRAM Deployment 🚀 --- ## 한국어 ### 모델 소개 이 모델은 [MyeongHo0621/eeve-vss-smh](https://huggingface.co/MyeongHo0621/eeve-vss-smh)를 **BitsAndBytes NF4 4-bit**로 양자화한 버전입니다. #### 주요 특징 - ✅ **저사양 GPU 지원**: GTX 시리즈, 6GB VRAM에서도 실행 가능 - ✅ **4-bit 양자화**: NF4 (NormalFloat4) - 품질 손실 최소 (1-2%) - ✅ **고품질 한국어**: 원본 모델의 성능 유지 ### 빠른 시작 #### 설치 ```bash pip install transformers torch bitsandbytes accelerate ``` **필수**: `bitsandbytes` 라이브러리가 반드시 필요합니다! #### 기본 사용 ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch # 4-bit 설정 bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4" ) # 모델 로드 model = AutoModelForCausalLM.from_pretrained( "MyeongHo0621/eeve-vss-smh-bnb-4bit", quantization_config=bnb_config, device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit") # 프롬프트 템플릿 def create_prompt(user_input): return f"""A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. Human: {user_input} Assistant: """ # 대화 user_input = "파이썬으로 피보나치 수열 구현해줘" prompt = create_prompt(user_input) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.3, top_p=0.85, repetition_penalty=1.0, do_sample=True, pad_token_id=tokenizer.eos_token_id ) response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) print(response) ``` #### 대안: torch.dtype 직접 사용 ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # dtype 명시적 지정 (자동 4-bit 로딩) model = AutoModelForCausalLM.from_pretrained( "MyeongHo0621/eeve-vss-smh-bnb-4bit", device_map="auto", torch_dtype=torch.float16, # 또는 torch.bfloat16 trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit") ``` #### 간단한 방법 (저장된 설정 자동 로드) ```python from transformers import AutoModelForCausalLM, AutoTokenizer # 저장된 양자화 설정을 자동으로 로드 model = AutoModelForCausalLM.from_pretrained( "MyeongHo0621/eeve-vss-smh-bnb-4bit", device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/eeve-vss-smh-bnb-4bit") ``` ### 시스템 요구사항 #### 최소 사양 | 구성 요소 | 최소 사양 | 추천 사양 | |---------|---------|---------| | **GPU** | GTX 1660 (6GB) | RTX 3060 (12GB) | | **VRAM** | 4GB | 6GB+ | | **RAM** | 8GB | 16GB+ | | **CUDA** | 11.0+ | 12.0+ | #### 테스트된 환경 - ✅ GTX 1660 (6GB VRAM) - 실행 가능 - ✅ RTX 2060 (6GB VRAM) - 실행 가능 - ✅ RTX 3060 (12GB VRAM) - 여유있음 - ✅ RTX 3090 (24GB VRAM) - 매우 여유있음 - ✅ H100 (80GB VRAM) - 오버킬 ### 양자화 세부사항 #### BitsAndBytes NF4 ```yaml Quantization Type: NF4 (NormalFloat4) Bits: 4-bit Compute dtype: float16 Double Quantization: True Method: Weight-only quantization ``` #### 성능 비교 | 버전 | 모델 크기 | VRAM 사용 | 품질 손실 | 추론 속도 | |------|----------|-----------|----------|----------| | **FP16 원본** | ~21GB | ~21GB | 0% | ⚡⚡⚡⚡ | | **BNB 4-bit** | ~5.5GB | ~3.5GB | 1-2% | ⚡⚡⚡ | ### 추천 생성 파라미터 ```python generation_config = { "max_new_tokens": 512, "temperature": 0.3, "top_p": 0.85, "repetition_penalty": 1.0, "do_sample": True, "pad_token_id": tokenizer.pad_token_id, "eos_token_id": tokenizer.eos_token_id, } ``` #### 용도별 파라미터 | 용도 | Temperature | Top P | 설명 | |------|-------------|-------|------| | **정확한 답변** | 0.1-0.3 | 0.8-0.9 | 사실 기반 질문 | | **균형 답변** | 0.5-0.7 | 0.85-0.95 | 일반적 사용 | | **창의적 답변** | 0.8-1.0 | 0.9-1.0 | 스토리, 시 등 | ### 성능 예시 #### 반말 → 존댓말 변환 **입력**: ``` WMS가 뭐야? ``` **출력**: ``` WMS는 Warehouse Management System의 약자로, 창고 관리 시스템을 의미합니다. 재고 추적, 입출고 관리, 피킹, 패킹 등의 물류 프로세스를 자동화하고 최적화하는 소프트웨어 시스템입니다. 효율적인 창고 운영을 위해 사용되며, 실시간 재고 가시성과 작업 생산성 향상을 제공합니다. ``` #### 코드 생성 **입력**: ``` 파이썬으로 리스트를 역순으로 만들어줘 ``` **출력**: ```python # 원본 모델과 동일한 고품질 코드 생성 my_list = [1, 2, 3, 4, 5] # 방법 1: reverse() my_list.reverse() # 방법 2: 슬라이싱 reversed_list = my_list[::-1] # 방법 3: reversed() reversed_list = list(reversed(my_list)) ``` ### 원본 모델 정보 이 모델은 다음 모델의 양자화 버전입니다: - **원본 모델**: [MyeongHo0621/eeve-vss-smh](https://huggingface.co/MyeongHo0621/eeve-vss-smh) - **베이스 모델**: [yanolja/EEVE-Korean-Instruct-10.8B-v1.0](https://huggingface.co/yanolja/EEVE-Korean-Instruct-10.8B-v1.0) - **훈련 데이터**: 100K+ 고품질 한국어 instruction 데이터 - **LoRA 설정**: r=64, alpha=128, dropout=0.05 자세한 훈련 과정은 [원본 모델 페이지](https://huggingface.co/MyeongHo0621/eeve-vss-smh)를 참조하세요. ### 문제 해결 #### CUDA Out of Memory ```python # 더 낮은 배치 크기 사용 generation_config = { "max_new_tokens": 256, # 512 → 256 ... } ``` #### bitsandbytes 설치 오류 ```bash # CUDA 버전 확인 nvidia-smi # CUDA 11.x pip install bitsandbytes # CUDA 12.x pip install bitsandbytes --upgrade ``` #### 느린 생성 속도 - 4-bit 양자화는 FP16보다 약간 느릴 수 있습니다 - 더 빠른 속도가 필요하면 [원본 FP16 모델](https://huggingface.co/MyeongHo0621/eeve-vss-smh) 사용 권장 ### 사용 사례 #### ✅ 적합한 경우 - 저사양 GPU (GTX 1660, RTX 2060) - VRAM 제약이 있는 환경 - 로컬 개발 및 테스트 - 개인 프로젝트 - 연구 및 교육 #### ⚠️ 권장하지 않는 경우 - 초고속 추론이 필요한 프로덕션 - 고사양 GPU가 충분한 환경 → [FP16 원본](https://huggingface.co/MyeongHo0621/eeve-vss-smh) 사용 ### 제한사항 - 4-bit 양자화로 인해 **약 1-2% 품질 손실** 가능 - 추론 속도가 FP16보다 약간 느림 - `bitsandbytes` 라이브러리 필수 - Windows에서 bitsandbytes 설치 시 추가 설정 필요할 수 있음 ### 라이선스 - **모델 라이선스**: CC-BY-NC-SA-4.0 - **베이스 모델**: [EEVE-Korean-Instruct-10.8B-v1.0](https://huggingface.co/yanolja/EEVE-Korean-Instruct-10.8B-v1.0) - **상업적 사용**: 제한적 (라이선스 참조) ### Citation ```bibtex @misc{eeve-vss-smh-bnb-4bit-2025, author = {MyeongHo0621}, title = {EEVE-VSS-SMH-BNB-4bit: 4-bit Quantized Korean Model}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/MyeongHo0621/eeve-vss-smh-bnb-4bit}}, note = {4-bit quantized version using BitsAndBytes NF4} } ``` ### Acknowledgments - **원본 모델**: [MyeongHo0621/eeve-vss-smh](https://huggingface.co/MyeongHo0621/eeve-vss-smh) - **베이스 모델**: [Yanolja EEVE](https://huggingface.co/yanolja/EEVE-Korean-Instruct-10.8B-v1.0) - **양자화 라이브러리**: [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) - **프레임워크**: Hugging Face Transformers ### 관련 모델 | 모델 | 크기 | VRAM | 용도 | |------|------|------|------| | [eeve-vss-smh](https://huggingface.co/MyeongHo0621/eeve-vss-smh) | 21GB | 21GB | 고사양 GPU | | **eeve-vss-smh-bnb-4bit** | 5.5GB | 3.5GB | 저사양 GPU ⭐ | ### Contact - **원본 모델**: [eeve-vss-smh](https://huggingface.co/MyeongHo0621/eeve-vss-smh) - **Github** : [tuned_solar](https://github.com/EnzoMH/tuned_solar/tree/main/eeve) --- **양자화 일자**: 2025-10-11 **방법**: BitsAndBytes NF4 4-bit **상태**: 저사양 GPU 배포 준비 완료 🚀