oz1115
/

korean-gpt-150m

+---
+language:
+- ko
+license: mit
+tags:
+- korean
+- gpt
+- causal-lm
+- from-scratch
+datasets:
+- oz1115/korean-pretraining-corpus
+widget:
+- text: "인공지능의 미래는"
+- text: "한국의 전통 문화는"
+---
+# korean-gpt-150m
+한국어 Foundation Model (GPT-style, From Scratch)
+## Model Description
+이 모델은 처음부터(from scratch) 학습된 한국어 생성 모델입니다.
+- **Language:** Korean (한국어)
+- **Model Type:** Autoregressive Language Model
+- **Architecture:** Custom GPT (Transformer Decoder)
+- **Training:** Self-supervised causal language modeling
+- **Dataset:** Korean text corpus
+## Intended Uses
+### Primary Use
+한국어 텍스트 생성:
+- 문장 완성
+- 창의적 글쓰기
+- 대화 생성
+### Out-of-Scope Uses
+- 사실 확인이 중요한 작업 (모델이 환각을 생성할 수 있음)
+- 의료, 법률 등 전문적 조언
+- 악의적 목적의 텍스트 생성
+## How to Use
+### Installation
+```bash
+pip install torch transformers tokenizers
+import torch
+from tokenizers import Tokenizer
+# 토크나이저 로드
+tokenizer = Tokenizer.from_pretrained("oz1115/korean-gpt-150m")
+# 모델 로드 (커스텀 구현 필요)
+# model = CustomGPTModel.from_pretrained("oz1115/korean-gpt-150m")
+# 텍스트 생성
+prompt = "인공지능의 미래는"
+input_ids = tokenizer.encode(prompt).ids
+# output = model.generate(input_ids, max_length=50)
+# generated_text = tokenizer.decode(output)
+Training Data
+한국어 위키피디아 및 공개 한국어 텍스트 코퍼스
+Training Procedure
+Preprocessing
+Tokenizer: BPE (Byte-Pair Encoding)
+Vocabulary Size: 32,000
+Max Sequence Length: 512
+Training
+Optimizer: AdamW
+Learning Rate: 5e-4
+Batch Size: 32 (effective)
+Training Steps: ~10,000
+Hardware: Google Colab (T4 GPU)
+Evaluation
+한국어 텍스트 생성 품질 평가는 korean_evaluation.json 파일을 참조하세요.
+Limitations
+학습 데이터가 제한적이어서 일반화 성능에 한계가 있을 수 있습니다
+긴 문맥 처리에 제한이 있습니다 (max 512 tokens)
+사실 정확성이 보장되지 않습니다
+Bias and Risks
+모델이 학습 데이터의 편향을 반영할 수 있습니다. 생성된 텍스트를 비판적으로 검토하시기 바랍니다.
+Citation
+bibtex@misc{korean-gpt-150m,
+  author = {oz1115},
+  title = {korean-gpt-150m: Korean Foundation Model},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/oz1115/korean-gpt-150m}
+}
+Contact
+HuggingFace: @oz1115
+License
+MIT