---
language:
- ko
license: mit
tags:
- korean
- gpt
- causal-lm
- from-scratch
datasets:
- oz1115/korean-pretraining-corpus
widget:
- text: "인공지능의 미래는"
- text: "한국의 전통 문화는"
---

# korean-gpt-150m

한국어 Foundation Model (GPT-style, From Scratch)

## Model Description

이 모델은 처음부터(from scratch) 학습된 한국어 생성 모델입니다.

- **Language:** Korean (한국어)
- **Model Type:** Autoregressive Language Model
- **Architecture:** Custom GPT (Transformer Decoder)
- **Training:** Self-supervised causal language modeling
- **Dataset:** Korean text corpus

## Intended Uses

### Primary Use

한국어 텍스트 생성:
- 문장 완성
- 창의적 글쓰기
- 대화 생성

### Out-of-Scope Uses

- 사실 확인이 중요한 작업 (모델이 환각을 생성할 수 있음)
- 의료, 법률 등 전문적 조언
- 악의적 목적의 텍스트 생성

## How to Use

### Installation
```bash
pip install torch transformers tokenizers

import torch
from tokenizers import Tokenizer

# 토크나이저 로드
tokenizer = Tokenizer.from_pretrained("oz1115/korean-gpt-150m")

# 모델 로드 (커스텀 구현 필요)
# model = CustomGPTModel.from_pretrained("oz1115/korean-gpt-150m")

# 텍스트 생성
prompt = "인공지능의 미래는"
input_ids = tokenizer.encode(prompt).ids
# output = model.generate(input_ids, max_length=50)
# generated_text = tokenizer.decode(output)
Training Data
한국어 위키피디아 및 공개 한국어 텍스트 코퍼스
Training Procedure
Preprocessing

Tokenizer: BPE (Byte-Pair Encoding)
Vocabulary Size: 32,000
Max Sequence Length: 512

Training

Optimizer: AdamW
Learning Rate: 5e-4
Batch Size: 32 (effective)
Training Steps: ~10,000
Hardware: Google Colab (T4 GPU)

Evaluation
한국어 텍스트 생성 품질 평가는 korean_evaluation.json 파일을 참조하세요.
Limitations

학습 데이터가 제한적이어서 일반화 성능에 한계가 있을 수 있습니다
긴 문맥 처리에 제한이 있습니다 (max 512 tokens)
사실 정확성이 보장되지 않습니다

Bias and Risks
모델이 학습 데이터의 편향을 반영할 수 있습니다. 생성된 텍스트를 비판적으로 검토하시기 바랍니다.
Citation
bibtex@misc{korean-gpt-150m,
  author = {oz1115},
  title = {korean-gpt-150m: Korean Foundation Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/oz1115/korean-gpt-150m}
}
Contact

HuggingFace: @oz1115

License
MIT