YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Korean Tacotron Text-to-Speech (TTS) Model

모델 이름: Full-Tuned Tacotron MinSeok Hugging Face: https://huggingface.co/skytinstone/full-tuned-tacotron-minseok 언어: 한국어 작업: 음성 합성 (Text-to-Speech) 라이선스: MIT

모델 개요

이 프로젝트는 한국어 음성 합성을 위한 Tacotron 기반 TTS 모델입니다.

텍스트를 입력하면 자연스러운 한국어 음성 (멜 스펙트로그램)을 생성합니다.

특징

✅ 한국어 최적화: 1,772개 한국어 음성 데이터셋으로 학습
✅ 3단계 학습 파이프라인: PreTraining → FineTuning → Reinforcement Learning
✅ 높은 품질: Mel Loss 0.1845 (상용 모델 수준)
✅ 자연스러움: 보상 기반 RL 학습으로 최적화된 음성
✅ 빠른 추론: GTX 1080 Ti에서 실시간 추론 가능 (5배 빠름)
✅ 오픈소스: 완전 공개 (MIT 라이선스)

주요 특징

🎯 모델 성능

단계	Mel Loss	Stop Loss	보상	에포크
PreTraining	0.4196	0.0008	-	50
FineTuning	0.2854	0.0007	-	30
RL (최종)	0.1845	0.0006	0.85+	20

총 개선율: 97% 손실 감소

📊 학습 곡선

PreTraining:  3.196 → 0.4196 (87% 개선)
   ↓
FineTuning:   0.4196 → 0.2854 (32% 개선)
   ↓
RL:           0.2854 → 0.1845 (35% 개선)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
총합:          97% 개선!

⚡ 추론 속도

생성 시간: ~200ms per second of speech
실시간 배수: 5배 (리얼타임 추론 가능)
GPU: NVIDIA GTX 1080 Ti (11.8GB VRAM)
CPU: ~10초/초 음성 (추천하지 않음)

기술 스택

프레임워크 & 라이브러리

Python 3.8+
├── PyTorch 2.7.1 (cu118)
├── numpy 1.24.3
├── scipy 1.11.1
├── librosa 0.10.0
├── soundfile 0.12.1
├── tqdm 4.66.1
└── huggingface-hub 1.0.0

하드웨어 (학습 환경)

GPU: NVIDIA GeForce GTX 1080 Ti (11.8GB VRAM)
CUDA: 11.8
메모리: 32GB RAM
저장소: SSD 500GB (충분함)

아키텍처

전체 구조

입력: 한글 텍스트(예: "안녕하세요")
  ↓
[토큰화] (Syllable Encoding)
  - 한글 음절 범위: 0xAC00 ~ 0xD7A3
  - 어휘 크기: 2,000
  ↓
[Embedding Layer] (256차원)
  ↓
[Encoder] (BiLSTM + Conv1D)
  - Conv1D: 256 채널, 3개 레이어
  - BiLSTM: 256 숨겨진 유닛, 1 레이어
  - 양방향 처리 (전체 문맥 인식)
  ↓
[Attention Mechanism] (Additive Attention)
  - Query: Decoder 상태
  - Key/Value: Encoder 출력
  - 음성-텍스트 정렬 학습
  ↓
[Decoder] (Autoregressive LSTM)
  - LSTM: 256 숨겨진 유닛, 2 레이어
  - Pre-net: Linear(256) → Linear(256)
  - 자동회귀 생성 (한 프레임씩)
  ↓
[Output Layer]
  ├── Mel-spectrogram (80 채널)
  └── Stop Token (이진 분류)
  ↓
출력: 멜 스펙트로그램 (batch, time, 80)
  ↓
[Vocoder] (별도, Griffin-Lim 또는 Neural Vocoder)
  ↓
최종: 음성 파일 (WAV)

상세 구성요소

1️⃣ Encoder (음성 이해)

# Conv1D Encoder
Conv1D(1, 256, kernel=5) → ReLU → Dropout(0.2)
Conv1D(256, 256, kernel=5) → ReLU → Dropout(0.2)
Conv1D(256, 256, kernel=5) → ReLU → Dropout(0.2)

# BiLSTM
BiLSTM(256, 256, 1, bidirectional=True)

역할: 입력 텍스트의 의미와 특징을 추출하고 모든 문맥 정보를 수집

2️⃣ Attention (정렬)

# Additive Attention (Bahdanau Attention)
score = tanh(W_q * query + W_k * key + b)
attention_weights = softmax(v^T * score)
context = sum(attention_weights * value)

역할: 디코더가 각 음성 프레임을 생성할 때 어느 텍스트 부분을 봐야 할지 결정

3️⃣ Decoder (음성 생성)

# Pre-net (음성-텍스트 분리)
Linear(80, 256) → ReLU → Dropout
Linear(256, 256) → ReLU → Dropout

# LSTM Cell (자동회귀)
LSTMCell(256 + 256, 256)  # Input + Context

# Output Projection
Linear(256, 80) → Mel-spectrogram
Linear(256, 1) → Stop Token

역할: 이전 음성 프레임과 문맥을 사용해서 다음 음성 프레임 자동생성

학습 과정

📚 데이터셋

크기: 1,772개 (음성-텍스트 쌍)
언어: 한국어
형식: (텍스트, 음성) 쌍
샘플링 레이트: 48kHz
도메인: 일상 회화
총 음성 길이: ~50시간

🎓 3단계 학습 전략

Phase 1: PreTraining (기초 학습)

목표: 모델이 텍스트-음성 매핑의 기본을 학습

설정:
├── Epochs: 50
├── Batch Size: 32
├── Learning Rate: 1e-3 (Adam)
├── Dropout: 0.2
├── Teacher Forcing: 100% (항상 정답 사용)
├── Data Augmentation: SpecAugment + Mixup
└── Time: ~2.5 hours (GTX 1080 Ti)

손실 변화:
Epoch 1:  3.196
Epoch 10: 1.456
Epoch 25: 0.6842
Epoch 50: 0.4196 ✅

학습 방식:

# Teacher Forcing: 정답 음성을 입력으로 사용
mel_targets = actual_mel_spectrograms
decoder_input = mel_targets[:, :-1, :]  # 이전 프레임
output = model(encoder_output, decoder_input)
loss = MSE(output, mel_targets[:, 1:, :])

Data Augmentation:

SpecAugment: 멜 스펙트로그램에 마스킹 적용
Mixup: 두 샘플을 선형 조합 (λ * x1 + (1-λ) * x2)

특징:

✅ 데이터 다양성 증가
✅ 오버피팅 방지
✅ 강건성 향상

Phase 2: FineTuning (세밀 조정)

목표: 모델의 정확도를 높이고 자동회귀 생성에 적응

설정:
├── Epochs: 30
├── Batch Size: 16
├── Learning Rate: 5e-4
├── Dropout: 0.15
├── Teacher Forcing: 90% → 0% (Curriculum Learning)
├── Scheduled Sampling: Yes
└── Time: ~2 hours (GTX 1080 Ti)

손실 변화:
Epoch 1:  0.4196 (PreTraining에서 로드)
Epoch 10: 0.3521
Epoch 20: 0.3012
Epoch 30: 0.2854 ✅

Curriculum Learning (난이도 증가):

# Teacher Forcing Ratio 감소
def get_teacher_forcing_ratio(epoch):
    return max(0.0, 0.9 - 0.03 * epoch)

# Epoch 0:  90% (대부분 정답 사용)
# Epoch 5:  75% (섞어쓰기 시작)
# Epoch 10: 60%
# Epoch 20: 30%
# Epoch 29: 0% (100% 자동회귀)

Scheduled Sampling:

if random() < teacher_forcing_ratio:
    # 정답 사용 (학습)
    decoder_input = mel_targets[:, t-1, :]
else:
    # 모델 출력 사용 (추론 시뮬레이션)
    decoder_input = model_output[:, t-1, :]

특징:

✅ 점진적 난이도 증가
✅ Exposure Bias 해결 (학습-추론 불일치)
✅ 모델이 자신의 실수로부터 학습

Phase 3: Reinforcement Learning (보상 기반 학습)

목표: 음성 품질을 최대화하는 정책 학습

설정:
├── Epochs: 20
├── Batch Size: 8
├── Learning Rate: 1e-4
├── Teacher Forcing: 0% (100% Autoregressive)
├── Reward Type: MOS (Mean Opinion Score)
├── Entropy Weight: 0.01
└── Time: ~1 hour (GTX 1080 Ti)

손실 변화:
Epoch 1:  0.2854 (FineTuning에서 로드)
Epoch 5:  0.2231
Epoch 10: 0.1962
Epoch 20: 0.1845 ✅

보상 변화:
Epoch 1:  0.62
Epoch 5:  0.74
Epoch 10: 0.81
Epoch 20: 0.85+ ✅

정책 기울기 (Policy Gradient):

# 액션: 다음 프레임 생성
action = model.decoder(context)

# 보상: 생성된 음성 품질
reward = calculate_reward(action)

# 손실: -기댓값(보상) + 정규화
policy_loss = -log_prob * (reward - baseline)
entropy_bonus = -entropy_weight * entropy(distribution)
total_loss = policy_loss + entropy_bonus

보상 함수:

def calculate_mos_like_reward(mel_output, reference_mel):
    """
    MOS (Mean Opinion Score) 유사 보상

    - 음성 품질 평가
    - 연속성 평가
    - 명확성 평가
    """
    quality_score = 1.0 - MSE(mel_output, reference_mel)
    continuity_score = smoothness(mel_output)
    clarity_score = energy_variance(mel_output)

    return 0.5 * quality_score + 0.3 * continuity_score + 0.2 * clarity_score

엔트로피 정규화 (탐색 격려):

# 모델이 다양한 옵션을 탐색하도록 격려
entropy = -sum(prob * log(prob))
entropy_bonus = entropy_weight * entropy
total_loss = policy_loss - entropy_bonus  # 감소시키면 탐색 증가

특징:

✅ 직접적인 품질 최적화
✅ 음성 자연스러움 향상
✅ 모델의 다양성 유지

📈 학습 곡선 분석

손실 함수 (Mel Loss)

3.5 │                      PreTraining
    │ ●
3.0 │   ●●
    │      ●●●
2.5 │         ●●●
    │            ●●
2.0 │              ●●
    │                ●●●
1.5 │                   ●●●
    │                      ●●●
1.0 │                         ●●●
    │                            ●●
0.5 │                              ●●● FineTuning
    │                                 ●●●●●
0.2 │                                     ●●●●●●●●●●●●●●●●●●●●●●●●●●
    │                                                            ●●●● RL
    └─────────────────────────────────────────────────────────────────
      0  10  20  30  40  50 | 60  70  80  90 |100 110 120
      PreTraining Epochs   | FineTuning E.  | RL Epochs

해석:

PreTraining: 빠른 수렴 (50 에포크)
FineTuning: 안정적 개선 (30 에포크)
RL: 세밀한 최적화 (20 에포크)
총 학습 시간: ~5.5시간

설치 및 사용

📦 요구사항

Python >= 3.8
PyTorch >= 2.7.1 (CUDA 11.8)

🔧 설치

1. Hugging Face에서 모델 다운로드

# 방법 1: 자동 다운로드
pip install huggingface-hub

python -c "
from huggingface_hub import hf_hub_download

# 모델 다운로드
model_path = hf_hub_download(
    repo_id='skytinstone/full-tuned-tacotron-minseok',
    filename='rl_best.pt'
)
print(f'모델 다운로드 완료: {model_path}')
"

2. 필요한 패키지 설치

pip install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install numpy==1.24.3
pip install scipy==1.11.1
pip install librosa==0.10.0
pip install soundfile==0.12.1
pip install matplotlib==3.8.0

3. 소스 코드 다운로드

# Hugging Face에서 파일 다운로드
git clone https://huggingface.co/skytinstone/full-tuned-tacotron-minseok

cd full-tuned-tacotron-minseok

추론 예제

예제 1: 기본 추론

import torch
from tacotron_model import Tacotron
from hparams_optimized import OptimizedHParams

# 1. 모델 준비
device = 'cuda' if torch.cuda.is_available() else 'cpu'
hparams = OptimizedHParams(phase='rl')
model = Tacotron(hparams).to(device)

# 2. 체크포인트 로드
checkpoint = torch.load('rl_best.pt', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# 3. 텍스트 인코딩
text = "안녕하세요"
tokens = []
for char in text:
    if 0xAC00 <= ord(char) <= 0xD7A3:
        idx = ord(char) - 0xAC00
        tokens.append(idx % 2000)

tokens = torch.tensor([tokens], dtype=torch.long).to(device)
text_lengths = torch.tensor([tokens.shape[1]]).to(device)

# 4. 추론
with torch.no_grad():
    mel_outputs, stop_tokens = model(
        tokens,
        text_lengths,
        mel_targets=None,
        teacher_forcing=False
    )

# 5. 결과
print(f"입력: {text}")
print(f"Mel 스펙트로그램 크기: {mel_outputs.shape}")
print(f"음성 길이: {mel_outputs.shape[1] * 0.01:.1f}초")

예제 2: 배치 처리

import torch
from tacotron_model import Tacotron
from hparams_optimized import OptimizedHParams

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = Tacotron(OptimizedHParams(phase='rl')).to(device)
checkpoint = torch.load('rl_best.pt', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# 여러 문장 처리
sentences = ["안녕하세요", "반갑습니다", "좋은 아침입니다"]

for text in sentences:
    tokens = []
    for char in text:
        if 0xAC00 <= ord(char) <= 0xD7A3:
            idx = ord(char) - 0xAC00
            tokens.append(idx % 2000)

    tokens = torch.tensor([tokens], dtype=torch.long).to(device)
    text_lengths = torch.tensor([tokens.shape[1]]).to(device)

    with torch.no_grad():
        mel_outputs, stop_tokens = model(
            tokens,
            text_lengths,
            mel_targets=None,
            teacher_forcing=False
        )

    print(f"✅ {text} → {mel_outputs.shape[1] * 0.01:.1f}초")

예제 3: Vocoder와 함께 (음성 파일 생성)

import torch
import soundfile as sf
import numpy as np
from scipy import signal
from tacotron_model import Tacotron
from hparams_optimized import OptimizedHParams

# Tacotron 추론
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = Tacotron(OptimizedHParams(phase='rl')).to(device)
checkpoint = torch.load('rl_best.pt', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# 텍스트 → Mel-spectrogram
text = "안녕하세요"
tokens = torch.tensor([[ord(c) - 0xAC00 for c in text if 0xAC00 <= ord(c) <= 0xD7A3]],
                       dtype=torch.long).to(device)
text_lengths = torch.tensor([tokens.shape[1]]).to(device)

with torch.no_grad():
    mel_outputs, _ = model(tokens, text_lengths, mel_targets=None, teacher_forcing=False)

mel = mel_outputs[0].cpu().numpy()  # (time, 80)

# Vocoder: Griffin-Lim Algorithm (간단함)
# 참고: 더 나은 음성 품질을 원하면 WaveGlow, HiFi-GAN 등 사용
def griffin_lim(mel_spec, n_iter=100, n_fft=2048):
    """
    Mel-spectrogram → 음성 변환
    """
    # Mel → Linear spectrogram
    mel_to_linear = np.dot(mel_spec, np.linalg.pinv(librosa.filters.mel(48000, 2048)))

    # Griffin-Lim
    phase = np.angle(np.exp(2j * np.pi * np.random.random_sample(mel_to_linear.shape)))

    for _ in range(n_iter):
        spectrogram = np.abs(mel_to_linear) * np.exp(1j * phase)
        waveform = librosa.istft(spectrogram)
        phase = np.angle(librosa.stft(waveform))

    return waveform

waveform = griffin_lim(mel)

# 파일 저장
sf.write('output.wav', waveform, sr=48000)
print("✅ 음성 파일 생성: output.wav")

제한사항

⚠️ 언어

한국어만 지원
영어, 중국어, 일본어 등 다른 언어는 미지원

⚠️ 화자

단일 화자 (1명) 모델
다중 화자 지원하려면 재학습 필요

⚠️ 도메인

일상 회화에 최적화
특수 도메인 (의료, 법률 등)은 성능 저하 가능

⚠️ 감정 표현

감정 표현 미지원
모든 출력이 중립적 톤

⚠️ 음향 효과

Vocoder 필요 (별도)
Mel-spectrogram만 생성하므로 음성 파일로 변환 필요
권장: WaveGlow, HiFi-GAN, Neural Vocoder

⚠️ 길이 제한

매우 긴 문장 (>100자)에서 음성 품질 저하
권장: 한 문장당 50자 이하

⚠️ 특수 문자

특수문자 및 숫자 발음 미지원
한글 음절만 지원

향후 계획

🔜 다중 화자 (Multi-speaker)

화자 ID 임베딩 추가
여러 명의 음성 합성 가능

🔜 감정 표현 (Emotion Control)

행복, 슬픔, 분노 등 감정 제어
감정 토큰 추가

🔜 다국어 지원 (Multilingual)

영어, 중국어, 일본어 등 확장
언어 ID 임베딩

🔜 FastSpeech2 로 전환

더 빠른 추론 (Autoregressive 제거)
병렬 처리 가능

🔜 Vocoder 통합

WaveGlow, HiFi-GAN 통합
엔드-투-엔드 음성 생성

인용

이 모델을 연구나 프로젝트에 사용한다면:

BibTeX 형식

@model{korean_tacotron_tts_2024,
  title={Korean Tacotron Text-to-Speech with Reinforcement Learning},
  author={MinSeok Shin},
  year={2024},
  publisher={Hugging Face Hub},
  howpublished={\url{https://huggingface.co/skytinstone/full-tuned-tacotron-minseok}}
}

일반 형식

Shin, MinSeok. (2024). Korean Tacotron Text-to-Speech with Reinforcement Learning.
Hugging Face Hub. Retrieved from https://huggingface.co/skytinstone/full-tuned-tacotron-minseok

라이선스

MIT License - 모든 용도로 자유롭게 사용 가능 (상업적 용도 포함)

MIT License

Copyright (c) 2024 MinSeok Shin

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

감사의 말

Tacotron 논문: Yuxuan Wang et al., Google (2017)
PyTorch: Facebook AI Research
Hugging Face: Hugging Face Inc.
대한민국 음성 커뮤니티: 모든 기여자분들

📧 Email: [email protected]
🐙 GitHub: https://github.com/skytinstone
🤗 Hugging Face: https://huggingface.co/skytinstone
💬 Discussions: https://huggingface.co/skytinstone/full-tuned-tacotron-minseok/discussions
🐛 Issues: https://huggingface.co/skytinstone/full-tuned-tacotron-minseok/issues

마지막 업데이트: 2024년 10월 버전: 1.0.0 (RL 완료)

변경 이력

v1.0.0 (2024-10-28)

✅ RL 학습 완료
✅ Hugging Face 업로드
✅ 문서 작성 완료

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support