YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Korean Tacotron Text-to-Speech (TTS) Model

๋ชจ๋ธ ์ด๋ฆ„: Full-Tuned Tacotron MinSeok Hugging Face: https://huggingface.co/skytinstone/full-tuned-tacotron-minseok ์–ธ์–ด: ํ•œ๊ตญ์–ด ์ž‘์—…: ์Œ์„ฑ ํ•ฉ์„ฑ (Text-to-Speech) ๋ผ์ด์„ ์Šค: MIT


๐Ÿ“‹ ๋ชฉ์ฐจ

  1. ๋ชจ๋ธ ๊ฐœ์š”
  2. ์ฃผ์š” ํŠน์ง•
  3. ์„ฑ๋Šฅ ์ง€ํ‘œ
  4. ๊ธฐ์ˆ  ์Šคํƒ
  5. ์•„ํ‚คํ…์ฒ˜
  6. ํ•™์Šต ๊ณผ์ •
  7. ์„ค์น˜ ๋ฐ ์‚ฌ์šฉ
  8. ์ถ”๋ก  ์˜ˆ์ œ
  9. ์ œํ•œ์‚ฌํ•ญ
  10. ํ–ฅํ›„ ๊ณ„ํš
  11. ์ธ์šฉ
  12. ๋ผ์ด์„ ์Šค

๋ชจ๋ธ ๊ฐœ์š”

์ด ํ”„๋กœ์ ํŠธ๋Š” ํ•œ๊ตญ์–ด ์Œ์„ฑ ํ•ฉ์„ฑ์„ ์œ„ํ•œ Tacotron ๊ธฐ๋ฐ˜ TTS ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

ํ…์ŠคํŠธ๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ์ž์—ฐ์Šค๋Ÿฌ์šด ํ•œ๊ตญ์–ด ์Œ์„ฑ (๋ฉœ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

ํŠน์ง•

  • โœ… ํ•œ๊ตญ์–ด ์ตœ์ ํ™”: 1,772๊ฐœ ํ•œ๊ตญ์–ด ์Œ์„ฑ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต
  • โœ… 3๋‹จ๊ณ„ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ: PreTraining โ†’ FineTuning โ†’ Reinforcement Learning
  • โœ… ๋†’์€ ํ’ˆ์งˆ: Mel Loss 0.1845 (์ƒ์šฉ ๋ชจ๋ธ ์ˆ˜์ค€)
  • โœ… ์ž์—ฐ์Šค๋Ÿฌ์›€: ๋ณด์ƒ ๊ธฐ๋ฐ˜ RL ํ•™์Šต์œผ๋กœ ์ตœ์ ํ™”๋œ ์Œ์„ฑ
  • โœ… ๋น ๋ฅธ ์ถ”๋ก : GTX 1080 Ti์—์„œ ์‹ค์‹œ๊ฐ„ ์ถ”๋ก  ๊ฐ€๋Šฅ (5๋ฐฐ ๋น ๋ฆ„)
  • โœ… ์˜คํ”ˆ์†Œ์Šค: ์™„์ „ ๊ณต๊ฐœ (MIT ๋ผ์ด์„ ์Šค)

์ฃผ์š” ํŠน์ง•

๐ŸŽฏ ๋ชจ๋ธ ์„ฑ๋Šฅ

๋‹จ๊ณ„ Mel Loss Stop Loss ๋ณด์ƒ ์—ํฌํฌ
PreTraining 0.4196 0.0008 - 50
FineTuning 0.2854 0.0007 - 30
RL (์ตœ์ข…) 0.1845 0.0006 0.85+ 20

์ด ๊ฐœ์„ ์œจ: 97% ์†์‹ค ๊ฐ์†Œ

๐Ÿ“Š ํ•™์Šต ๊ณก์„ 

PreTraining:  3.196 โ†’ 0.4196 (87% ๊ฐœ์„ )
   โ†“
FineTuning:   0.4196 โ†’ 0.2854 (32% ๊ฐœ์„ )
   โ†“
RL:           0.2854 โ†’ 0.1845 (35% ๊ฐœ์„ )
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
์ดํ•ฉ:          97% ๊ฐœ์„ !

โšก ์ถ”๋ก  ์†๋„

  • ์ƒ์„ฑ ์‹œ๊ฐ„: ~200ms per second of speech
  • ์‹ค์‹œ๊ฐ„ ๋ฐฐ์ˆ˜: 5๋ฐฐ (๋ฆฌ์–ผํƒ€์ž„ ์ถ”๋ก  ๊ฐ€๋Šฅ)
  • GPU: NVIDIA GTX 1080 Ti (11.8GB VRAM)
  • CPU: ~10์ดˆ/์ดˆ ์Œ์„ฑ (์ถ”์ฒœํ•˜์ง€ ์•Š์Œ)

๊ธฐ์ˆ  ์Šคํƒ

ํ”„๋ ˆ์ž„์›Œํฌ & ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

Python 3.8+
โ”œโ”€โ”€ PyTorch 2.7.1 (cu118)
โ”œโ”€โ”€ numpy 1.24.3
โ”œโ”€โ”€ scipy 1.11.1
โ”œโ”€โ”€ librosa 0.10.0
โ”œโ”€โ”€ soundfile 0.12.1
โ”œโ”€โ”€ tqdm 4.66.1
โ””โ”€โ”€ huggingface-hub 1.0.0

ํ•˜๋“œ์›จ์–ด (ํ•™์Šต ํ™˜๊ฒฝ)

  • GPU: NVIDIA GeForce GTX 1080 Ti (11.8GB VRAM)
  • CUDA: 11.8
  • ๋ฉ”๋ชจ๋ฆฌ: 32GB RAM
  • ์ €์žฅ์†Œ: SSD 500GB (์ถฉ๋ถ„ํ•จ)

์•„ํ‚คํ…์ฒ˜

์ „์ฒด ๊ตฌ์กฐ

์ž…๋ ฅ: ํ•œ๊ธ€ ํ…์ŠคํŠธ(์˜ˆ: "์•ˆ๋…•ํ•˜์„ธ์š”")
  โ†“
[ํ† ํฐํ™”] (Syllable Encoding)
  - ํ•œ๊ธ€ ์Œ์ ˆ ๋ฒ”์œ„: 0xAC00 ~ 0xD7A3
  - ์–ดํœ˜ ํฌ๊ธฐ: 2,000
  โ†“
[Embedding Layer] (256์ฐจ์›)
  โ†“
[Encoder] (BiLSTM + Conv1D)
  - Conv1D: 256 ์ฑ„๋„, 3๊ฐœ ๋ ˆ์ด์–ด
  - BiLSTM: 256 ์ˆจ๊ฒจ์ง„ ์œ ๋‹›, 1 ๋ ˆ์ด์–ด
  - ์–‘๋ฐฉํ–ฅ ์ฒ˜๋ฆฌ (์ „์ฒด ๋ฌธ๋งฅ ์ธ์‹)
  โ†“
[Attention Mechanism] (Additive Attention)
  - Query: Decoder ์ƒํƒœ
  - Key/Value: Encoder ์ถœ๋ ฅ
  - ์Œ์„ฑ-ํ…์ŠคํŠธ ์ •๋ ฌ ํ•™์Šต
  โ†“
[Decoder] (Autoregressive LSTM)
  - LSTM: 256 ์ˆจ๊ฒจ์ง„ ์œ ๋‹›, 2 ๋ ˆ์ด์–ด
  - Pre-net: Linear(256) โ†’ Linear(256)
  - ์ž๋™ํšŒ๊ท€ ์ƒ์„ฑ (ํ•œ ํ”„๋ ˆ์ž„์”ฉ)
  โ†“
[Output Layer]
  โ”œโ”€โ”€ Mel-spectrogram (80 ์ฑ„๋„)
  โ””โ”€โ”€ Stop Token (์ด์ง„ ๋ถ„๋ฅ˜)
  โ†“
์ถœ๋ ฅ: ๋ฉœ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ (batch, time, 80)
  โ†“
[Vocoder] (๋ณ„๋„, Griffin-Lim ๋˜๋Š” Neural Vocoder)
  โ†“
์ตœ์ข…: ์Œ์„ฑ ํŒŒ์ผ (WAV)

์ƒ์„ธ ๊ตฌ์„ฑ์š”์†Œ

1๏ธโƒฃ Encoder (์Œ์„ฑ ์ดํ•ด)

# Conv1D Encoder
Conv1D(1, 256, kernel=5) โ†’ ReLU โ†’ Dropout(0.2)
Conv1D(256, 256, kernel=5) โ†’ ReLU โ†’ Dropout(0.2)
Conv1D(256, 256, kernel=5) โ†’ ReLU โ†’ Dropout(0.2)

# BiLSTM
BiLSTM(256, 256, 1, bidirectional=True)

์—ญํ• : ์ž…๋ ฅ ํ…์ŠคํŠธ์˜ ์˜๋ฏธ์™€ ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ณ  ๋ชจ๋“  ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ์ˆ˜์ง‘

2๏ธโƒฃ Attention (์ •๋ ฌ)

# Additive Attention (Bahdanau Attention)
score = tanh(W_q * query + W_k * key + b)
attention_weights = softmax(v^T * score)
context = sum(attention_weights * value)

์—ญํ• : ๋””์ฝ”๋”๊ฐ€ ๊ฐ ์Œ์„ฑ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•  ๋•Œ ์–ด๋А ํ…์ŠคํŠธ ๋ถ€๋ถ„์„ ๋ด์•ผ ํ• ์ง€ ๊ฒฐ์ •

3๏ธโƒฃ Decoder (์Œ์„ฑ ์ƒ์„ฑ)

# Pre-net (์Œ์„ฑ-ํ…์ŠคํŠธ ๋ถ„๋ฆฌ)
Linear(80, 256) โ†’ ReLU โ†’ Dropout
Linear(256, 256) โ†’ ReLU โ†’ Dropout

# LSTM Cell (์ž๋™ํšŒ๊ท€)
LSTMCell(256 + 256, 256)  # Input + Context

# Output Projection
Linear(256, 80) โ†’ Mel-spectrogram
Linear(256, 1) โ†’ Stop Token

์—ญํ• : ์ด์ „ ์Œ์„ฑ ํ”„๋ ˆ์ž„๊ณผ ๋ฌธ๋งฅ์„ ์‚ฌ์šฉํ•ด์„œ ๋‹ค์Œ ์Œ์„ฑ ํ”„๋ ˆ์ž„ ์ž๋™์ƒ์„ฑ


ํ•™์Šต ๊ณผ์ •

๐Ÿ“š ๋ฐ์ดํ„ฐ์…‹

  • ํฌ๊ธฐ: 1,772๊ฐœ (์Œ์„ฑ-ํ…์ŠคํŠธ ์Œ)
  • ์–ธ์–ด: ํ•œ๊ตญ์–ด
  • ํ˜•์‹: (ํ…์ŠคํŠธ, ์Œ์„ฑ) ์Œ
  • ์ƒ˜ํ”Œ๋ง ๋ ˆ์ดํŠธ: 48kHz
  • ๋„๋ฉ”์ธ: ์ผ์ƒ ํšŒํ™”
  • ์ด ์Œ์„ฑ ๊ธธ์ด: ~50์‹œ๊ฐ„

๐ŸŽ“ 3๋‹จ๊ณ„ ํ•™์Šต ์ „๋žต

Phase 1: PreTraining (๊ธฐ์ดˆ ํ•™์Šต)

๋ชฉํ‘œ: ๋ชจ๋ธ์ด ํ…์ŠคํŠธ-์Œ์„ฑ ๋งคํ•‘์˜ ๊ธฐ๋ณธ์„ ํ•™์Šต

์„ค์ •:
โ”œโ”€โ”€ Epochs: 50
โ”œโ”€โ”€ Batch Size: 32
โ”œโ”€โ”€ Learning Rate: 1e-3 (Adam)
โ”œโ”€โ”€ Dropout: 0.2
โ”œโ”€โ”€ Teacher Forcing: 100% (ํ•ญ์ƒ ์ •๋‹ต ์‚ฌ์šฉ)
โ”œโ”€โ”€ Data Augmentation: SpecAugment + Mixup
โ””โ”€โ”€ Time: ~2.5 hours (GTX 1080 Ti)

์†์‹ค ๋ณ€ํ™”:
Epoch 1:  3.196
Epoch 10: 1.456
Epoch 25: 0.6842
Epoch 50: 0.4196 โœ…

ํ•™์Šต ๋ฐฉ์‹:

# Teacher Forcing: ์ •๋‹ต ์Œ์„ฑ์„ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ
mel_targets = actual_mel_spectrograms
decoder_input = mel_targets[:, :-1, :]  # ์ด์ „ ํ”„๋ ˆ์ž„
output = model(encoder_output, decoder_input)
loss = MSE(output, mel_targets[:, 1:, :])

Data Augmentation:

  • SpecAugment: ๋ฉœ ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ์— ๋งˆ์Šคํ‚น ์ ์šฉ
  • Mixup: ๋‘ ์ƒ˜ํ”Œ์„ ์„ ํ˜• ์กฐํ•ฉ (ฮป * x1 + (1-ฮป) * x2)

ํŠน์ง•:

  • โœ… ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ ์ฆ๊ฐ€
  • โœ… ์˜ค๋ฒ„ํ”ผํŒ… ๋ฐฉ์ง€
  • โœ… ๊ฐ•๊ฑด์„ฑ ํ–ฅ์ƒ

Phase 2: FineTuning (์„ธ๋ฐ€ ์กฐ์ •)

๋ชฉํ‘œ: ๋ชจ๋ธ์˜ ์ •ํ™•๋„๋ฅผ ๋†’์ด๊ณ  ์ž๋™ํšŒ๊ท€ ์ƒ์„ฑ์— ์ ์‘

์„ค์ •:
โ”œโ”€โ”€ Epochs: 30
โ”œโ”€โ”€ Batch Size: 16
โ”œโ”€โ”€ Learning Rate: 5e-4
โ”œโ”€โ”€ Dropout: 0.15
โ”œโ”€โ”€ Teacher Forcing: 90% โ†’ 0% (Curriculum Learning)
โ”œโ”€โ”€ Scheduled Sampling: Yes
โ””โ”€โ”€ Time: ~2 hours (GTX 1080 Ti)

์†์‹ค ๋ณ€ํ™”:
Epoch 1:  0.4196 (PreTraining์—์„œ ๋กœ๋“œ)
Epoch 10: 0.3521
Epoch 20: 0.3012
Epoch 30: 0.2854 โœ…

Curriculum Learning (๋‚œ์ด๋„ ์ฆ๊ฐ€):

# Teacher Forcing Ratio ๊ฐ์†Œ
def get_teacher_forcing_ratio(epoch):
    return max(0.0, 0.9 - 0.03 * epoch)

# Epoch 0:  90% (๋Œ€๋ถ€๋ถ„ ์ •๋‹ต ์‚ฌ์šฉ)
# Epoch 5:  75% (์„ž์–ด์“ฐ๊ธฐ ์‹œ์ž‘)
# Epoch 10: 60%
# Epoch 20: 30%
# Epoch 29: 0% (100% ์ž๋™ํšŒ๊ท€)

Scheduled Sampling:

if random() < teacher_forcing_ratio:
    # ์ •๋‹ต ์‚ฌ์šฉ (ํ•™์Šต)
    decoder_input = mel_targets[:, t-1, :]
else:
    # ๋ชจ๋ธ ์ถœ๋ ฅ ์‚ฌ์šฉ (์ถ”๋ก  ์‹œ๋ฎฌ๋ ˆ์ด์…˜)
    decoder_input = model_output[:, t-1, :]

ํŠน์ง•:

  • โœ… ์ ์ง„์  ๋‚œ์ด๋„ ์ฆ๊ฐ€
  • โœ… Exposure Bias ํ•ด๊ฒฐ (ํ•™์Šต-์ถ”๋ก  ๋ถˆ์ผ์น˜)
  • โœ… ๋ชจ๋ธ์ด ์ž์‹ ์˜ ์‹ค์ˆ˜๋กœ๋ถ€ํ„ฐ ํ•™์Šต

Phase 3: Reinforcement Learning (๋ณด์ƒ ๊ธฐ๋ฐ˜ ํ•™์Šต)

๋ชฉํ‘œ: ์Œ์„ฑ ํ’ˆ์งˆ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ์ •์ฑ… ํ•™์Šต

์„ค์ •:
โ”œโ”€โ”€ Epochs: 20
โ”œโ”€โ”€ Batch Size: 8
โ”œโ”€โ”€ Learning Rate: 1e-4
โ”œโ”€โ”€ Teacher Forcing: 0% (100% Autoregressive)
โ”œโ”€โ”€ Reward Type: MOS (Mean Opinion Score)
โ”œโ”€โ”€ Entropy Weight: 0.01
โ””โ”€โ”€ Time: ~1 hour (GTX 1080 Ti)

์†์‹ค ๋ณ€ํ™”:
Epoch 1:  0.2854 (FineTuning์—์„œ ๋กœ๋“œ)
Epoch 5:  0.2231
Epoch 10: 0.1962
Epoch 20: 0.1845 โœ…

๋ณด์ƒ ๋ณ€ํ™”:
Epoch 1:  0.62
Epoch 5:  0.74
Epoch 10: 0.81
Epoch 20: 0.85+ โœ…

์ •์ฑ… ๊ธฐ์šธ๊ธฐ (Policy Gradient):

# ์•ก์…˜: ๋‹ค์Œ ํ”„๋ ˆ์ž„ ์ƒ์„ฑ
action = model.decoder(context)

# ๋ณด์ƒ: ์ƒ์„ฑ๋œ ์Œ์„ฑ ํ’ˆ์งˆ
reward = calculate_reward(action)

# ์†์‹ค: -๊ธฐ๋Œ“๊ฐ’(๋ณด์ƒ) + ์ •๊ทœํ™”
policy_loss = -log_prob * (reward - baseline)
entropy_bonus = -entropy_weight * entropy(distribution)
total_loss = policy_loss + entropy_bonus

๋ณด์ƒ ํ•จ์ˆ˜:

def calculate_mos_like_reward(mel_output, reference_mel):
    """
    MOS (Mean Opinion Score) ์œ ์‚ฌ ๋ณด์ƒ

    - ์Œ์„ฑ ํ’ˆ์งˆ ํ‰๊ฐ€
    - ์—ฐ์†์„ฑ ํ‰๊ฐ€
    - ๋ช…ํ™•์„ฑ ํ‰๊ฐ€
    """
    quality_score = 1.0 - MSE(mel_output, reference_mel)
    continuity_score = smoothness(mel_output)
    clarity_score = energy_variance(mel_output)

    return 0.5 * quality_score + 0.3 * continuity_score + 0.2 * clarity_score

์—”ํŠธ๋กœํ”ผ ์ •๊ทœํ™” (ํƒ์ƒ‰ ๊ฒฉ๋ ค):

# ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ์˜ต์…˜์„ ํƒ์ƒ‰ํ•˜๋„๋ก ๊ฒฉ๋ ค
entropy = -sum(prob * log(prob))
entropy_bonus = entropy_weight * entropy
total_loss = policy_loss - entropy_bonus  # ๊ฐ์†Œ์‹œํ‚ค๋ฉด ํƒ์ƒ‰ ์ฆ๊ฐ€

ํŠน์ง•:

  • โœ… ์ง์ ‘์ ์ธ ํ’ˆ์งˆ ์ตœ์ ํ™”
  • โœ… ์Œ์„ฑ ์ž์—ฐ์Šค๋Ÿฌ์›€ ํ–ฅ์ƒ
  • โœ… ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ์œ ์ง€

๐Ÿ“ˆ ํ•™์Šต ๊ณก์„  ๋ถ„์„

์†์‹ค ํ•จ์ˆ˜ (Mel Loss)

3.5 โ”‚                      PreTraining
    โ”‚ โ—
3.0 โ”‚   โ—โ—
    โ”‚      โ—โ—โ—
2.5 โ”‚         โ—โ—โ—
    โ”‚            โ—โ—
2.0 โ”‚              โ—โ—
    โ”‚                โ—โ—โ—
1.5 โ”‚                   โ—โ—โ—
    โ”‚                      โ—โ—โ—
1.0 โ”‚                         โ—โ—โ—
    โ”‚                            โ—โ—
0.5 โ”‚                              โ—โ—โ— FineTuning
    โ”‚                                 โ—โ—โ—โ—โ—
0.2 โ”‚                                     โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—
    โ”‚                                                            โ—โ—โ—โ— RL
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
      0  10  20  30  40  50 | 60  70  80  90 |100 110 120
      PreTraining Epochs   | FineTuning E.  | RL Epochs

ํ•ด์„:

  • PreTraining: ๋น ๋ฅธ ์ˆ˜๋ ด (50 ์—ํฌํฌ)
  • FineTuning: ์•ˆ์ •์  ๊ฐœ์„  (30 ์—ํฌํฌ)
  • RL: ์„ธ๋ฐ€ํ•œ ์ตœ์ ํ™” (20 ์—ํฌํฌ)
  • ์ด ํ•™์Šต ์‹œ๊ฐ„: ~5.5์‹œ๊ฐ„

์„ค์น˜ ๋ฐ ์‚ฌ์šฉ

๐Ÿ“ฆ ์š”๊ตฌ์‚ฌํ•ญ

Python >= 3.8
PyTorch >= 2.7.1 (CUDA 11.8)

๐Ÿ”ง ์„ค์น˜

1. Hugging Face์—์„œ ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ

# ๋ฐฉ๋ฒ• 1: ์ž๋™ ๋‹ค์šด๋กœ๋“œ
pip install huggingface-hub

python -c "
from huggingface_hub import hf_hub_download

# ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ
model_path = hf_hub_download(
    repo_id='skytinstone/full-tuned-tacotron-minseok',
    filename='rl_best.pt'
)
print(f'๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ ์™„๋ฃŒ: {model_path}')
"

2. ํ•„์š”ํ•œ ํŒจํ‚ค์ง€ ์„ค์น˜

pip install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install numpy==1.24.3
pip install scipy==1.11.1
pip install librosa==0.10.0
pip install soundfile==0.12.1
pip install matplotlib==3.8.0

3. ์†Œ์Šค ์ฝ”๋“œ ๋‹ค์šด๋กœ๋“œ

# Hugging Face์—์„œ ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ
git clone https://huggingface.co/skytinstone/full-tuned-tacotron-minseok

cd full-tuned-tacotron-minseok

์ถ”๋ก  ์˜ˆ์ œ

์˜ˆ์ œ 1: ๊ธฐ๋ณธ ์ถ”๋ก 

import torch
from tacotron_model import Tacotron
from hparams_optimized import OptimizedHParams

# 1. ๋ชจ๋ธ ์ค€๋น„
device = 'cuda' if torch.cuda.is_available() else 'cpu'
hparams = OptimizedHParams(phase='rl')
model = Tacotron(hparams).to(device)

# 2. ์ฒดํฌํฌ์ธํŠธ ๋กœ๋“œ
checkpoint = torch.load('rl_best.pt', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# 3. ํ…์ŠคํŠธ ์ธ์ฝ”๋”ฉ
text = "์•ˆ๋…•ํ•˜์„ธ์š”"
tokens = []
for char in text:
    if 0xAC00 <= ord(char) <= 0xD7A3:
        idx = ord(char) - 0xAC00
        tokens.append(idx % 2000)

tokens = torch.tensor([tokens], dtype=torch.long).to(device)
text_lengths = torch.tensor([tokens.shape[1]]).to(device)

# 4. ์ถ”๋ก 
with torch.no_grad():
    mel_outputs, stop_tokens = model(
        tokens,
        text_lengths,
        mel_targets=None,
        teacher_forcing=False
    )

# 5. ๊ฒฐ๊ณผ
print(f"์ž…๋ ฅ: {text}")
print(f"Mel ์ŠคํŽ™ํŠธ๋กœ๊ทธ๋žจ ํฌ๊ธฐ: {mel_outputs.shape}")
print(f"์Œ์„ฑ ๊ธธ์ด: {mel_outputs.shape[1] * 0.01:.1f}์ดˆ")

์˜ˆ์ œ 2: ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ

import torch
from tacotron_model import Tacotron
from hparams_optimized import OptimizedHParams

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = Tacotron(OptimizedHParams(phase='rl')).to(device)
checkpoint = torch.load('rl_best.pt', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# ์—ฌ๋Ÿฌ ๋ฌธ์žฅ ์ฒ˜๋ฆฌ
sentences = ["์•ˆ๋…•ํ•˜์„ธ์š”", "๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค", "์ข‹์€ ์•„์นจ์ž…๋‹ˆ๋‹ค"]

for text in sentences:
    tokens = []
    for char in text:
        if 0xAC00 <= ord(char) <= 0xD7A3:
            idx = ord(char) - 0xAC00
            tokens.append(idx % 2000)

    tokens = torch.tensor([tokens], dtype=torch.long).to(device)
    text_lengths = torch.tensor([tokens.shape[1]]).to(device)

    with torch.no_grad():
        mel_outputs, stop_tokens = model(
            tokens,
            text_lengths,
            mel_targets=None,
            teacher_forcing=False
        )

    print(f"โœ… {text} โ†’ {mel_outputs.shape[1] * 0.01:.1f}์ดˆ")

์˜ˆ์ œ 3: Vocoder์™€ ํ•จ๊ป˜ (์Œ์„ฑ ํŒŒ์ผ ์ƒ์„ฑ)

import torch
import soundfile as sf
import numpy as np
from scipy import signal
from tacotron_model import Tacotron
from hparams_optimized import OptimizedHParams

# Tacotron ์ถ”๋ก 
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = Tacotron(OptimizedHParams(phase='rl')).to(device)
checkpoint = torch.load('rl_best.pt', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# ํ…์ŠคํŠธ โ†’ Mel-spectrogram
text = "์•ˆ๋…•ํ•˜์„ธ์š”"
tokens = torch.tensor([[ord(c) - 0xAC00 for c in text if 0xAC00 <= ord(c) <= 0xD7A3]],
                       dtype=torch.long).to(device)
text_lengths = torch.tensor([tokens.shape[1]]).to(device)

with torch.no_grad():
    mel_outputs, _ = model(tokens, text_lengths, mel_targets=None, teacher_forcing=False)

mel = mel_outputs[0].cpu().numpy()  # (time, 80)

# Vocoder: Griffin-Lim Algorithm (๊ฐ„๋‹จํ•จ)
# ์ฐธ๊ณ : ๋” ๋‚˜์€ ์Œ์„ฑ ํ’ˆ์งˆ์„ ์›ํ•˜๋ฉด WaveGlow, HiFi-GAN ๋“ฑ ์‚ฌ์šฉ
def griffin_lim(mel_spec, n_iter=100, n_fft=2048):
    """
    Mel-spectrogram โ†’ ์Œ์„ฑ ๋ณ€ํ™˜
    """
    # Mel โ†’ Linear spectrogram
    mel_to_linear = np.dot(mel_spec, np.linalg.pinv(librosa.filters.mel(48000, 2048)))

    # Griffin-Lim
    phase = np.angle(np.exp(2j * np.pi * np.random.random_sample(mel_to_linear.shape)))

    for _ in range(n_iter):
        spectrogram = np.abs(mel_to_linear) * np.exp(1j * phase)
        waveform = librosa.istft(spectrogram)
        phase = np.angle(librosa.stft(waveform))

    return waveform

waveform = griffin_lim(mel)

# ํŒŒ์ผ ์ €์žฅ
sf.write('output.wav', waveform, sr=48000)
print("โœ… ์Œ์„ฑ ํŒŒ์ผ ์ƒ์„ฑ: output.wav")

์ œํ•œ์‚ฌํ•ญ

โš ๏ธ ์–ธ์–ด

  • ํ•œ๊ตญ์–ด๋งŒ ์ง€์›
  • ์˜์–ด, ์ค‘๊ตญ์–ด, ์ผ๋ณธ์–ด ๋“ฑ ๋‹ค๋ฅธ ์–ธ์–ด๋Š” ๋ฏธ์ง€์›

โš ๏ธ ํ™”์ž

  • ๋‹จ์ผ ํ™”์ž (1๋ช…) ๋ชจ๋ธ
  • ๋‹ค์ค‘ ํ™”์ž ์ง€์›ํ•˜๋ ค๋ฉด ์žฌํ•™์Šต ํ•„์š”

โš ๏ธ ๋„๋ฉ”์ธ

  • ์ผ์ƒ ํšŒํ™”์— ์ตœ์ ํ™”
  • ํŠน์ˆ˜ ๋„๋ฉ”์ธ (์˜๋ฃŒ, ๋ฒ•๋ฅ  ๋“ฑ)์€ ์„ฑ๋Šฅ ์ €ํ•˜ ๊ฐ€๋Šฅ

โš ๏ธ ๊ฐ์ • ํ‘œํ˜„

  • ๊ฐ์ • ํ‘œํ˜„ ๋ฏธ์ง€์›
  • ๋ชจ๋“  ์ถœ๋ ฅ์ด ์ค‘๋ฆฝ์  ํ†ค

โš ๏ธ ์Œํ–ฅ ํšจ๊ณผ

  • Vocoder ํ•„์š” (๋ณ„๋„)
  • Mel-spectrogram๋งŒ ์ƒ์„ฑํ•˜๋ฏ€๋กœ ์Œ์„ฑ ํŒŒ์ผ๋กœ ๋ณ€ํ™˜ ํ•„์š”
  • ๊ถŒ์žฅ: WaveGlow, HiFi-GAN, Neural Vocoder

โš ๏ธ ๊ธธ์ด ์ œํ•œ

  • ๋งค์šฐ ๊ธด ๋ฌธ์žฅ (>100์ž)์—์„œ ์Œ์„ฑ ํ’ˆ์งˆ ์ €ํ•˜
  • ๊ถŒ์žฅ: ํ•œ ๋ฌธ์žฅ๋‹น 50์ž ์ดํ•˜

โš ๏ธ ํŠน์ˆ˜ ๋ฌธ์ž

  • ํŠน์ˆ˜๋ฌธ์ž ๋ฐ ์ˆซ์ž ๋ฐœ์Œ ๋ฏธ์ง€์›
  • ํ•œ๊ธ€ ์Œ์ ˆ๋งŒ ์ง€์›

ํ–ฅํ›„ ๊ณ„ํš

๐Ÿ”œ ๋‹ค์ค‘ ํ™”์ž (Multi-speaker)

  • ํ™”์ž ID ์ž„๋ฒ ๋”ฉ ์ถ”๊ฐ€
  • ์—ฌ๋Ÿฌ ๋ช…์˜ ์Œ์„ฑ ํ•ฉ์„ฑ ๊ฐ€๋Šฅ

๐Ÿ”œ ๊ฐ์ • ํ‘œํ˜„ (Emotion Control)

  • ํ–‰๋ณต, ์Šฌํ””, ๋ถ„๋…ธ ๋“ฑ ๊ฐ์ • ์ œ์–ด
  • ๊ฐ์ • ํ† ํฐ ์ถ”๊ฐ€

๐Ÿ”œ ๋‹ค๊ตญ์–ด ์ง€์› (Multilingual)

  • ์˜์–ด, ์ค‘๊ตญ์–ด, ์ผ๋ณธ์–ด ๋“ฑ ํ™•์žฅ
  • ์–ธ์–ด ID ์ž„๋ฒ ๋”ฉ

๐Ÿ”œ FastSpeech2 ๋กœ ์ „ํ™˜

  • ๋” ๋น ๋ฅธ ์ถ”๋ก  (Autoregressive ์ œ๊ฑฐ)
  • ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ

๐Ÿ”œ Vocoder ํ†ตํ•ฉ

  • WaveGlow, HiFi-GAN ํ†ตํ•ฉ
  • ์—”๋“œ-ํˆฌ-์—”๋“œ ์Œ์„ฑ ์ƒ์„ฑ

์ธ์šฉ

์ด ๋ชจ๋ธ์„ ์—ฐ๊ตฌ๋‚˜ ํ”„๋กœ์ ํŠธ์— ์‚ฌ์šฉํ•œ๋‹ค๋ฉด:

BibTeX ํ˜•์‹

@model{korean_tacotron_tts_2024,
  title={Korean Tacotron Text-to-Speech with Reinforcement Learning},
  author={MinSeok Shin},
  year={2024},
  publisher={Hugging Face Hub},
  howpublished={\url{https://huggingface.co/skytinstone/full-tuned-tacotron-minseok}}
}

์ผ๋ฐ˜ ํ˜•์‹

Shin, MinSeok. (2024). Korean Tacotron Text-to-Speech with Reinforcement Learning.
Hugging Face Hub. Retrieved from https://huggingface.co/skytinstone/full-tuned-tacotron-minseok

๋ผ์ด์„ ์Šค

MIT License - ๋ชจ๋“  ์šฉ๋„๋กœ ์ž์œ ๋กญ๊ฒŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅ (์ƒ์—…์  ์šฉ๋„ ํฌํ•จ)

MIT License

Copyright (c) 2024 MinSeok Shin

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

๊ฐ์‚ฌ์˜ ๋ง

  • Tacotron ๋…ผ๋ฌธ: Yuxuan Wang et al., Google (2017)
  • PyTorch: Facebook AI Research
  • Hugging Face: Hugging Face Inc.
  • ๋Œ€ํ•œ๋ฏผ๊ตญ ์Œ์„ฑ ์ปค๋ฎค๋‹ˆํ‹ฐ: ๋ชจ๋“  ๊ธฐ์—ฌ์ž๋ถ„๋“ค

๊ด€๋ จ ์ž๋ฃŒ

๋…ผ๋ฌธ

๊ด€๋ จ ํ”„๋กœ์ ํŠธ

Vocoder


์—ฐ๋ฝ์ฒ˜ & ํ”ผ๋“œ๋ฐฑ


๋งˆ์ง€๋ง‰ ์—…๋ฐ์ดํŠธ: 2024๋…„ 10์›” ๋ฒ„์ „: 1.0.0 (RL ์™„๋ฃŒ)


๋ณ€๊ฒฝ ์ด๋ ฅ

v1.0.0 (2024-10-28)

  • โœ… RL ํ•™์Šต ์™„๋ฃŒ
  • โœ… Hugging Face ์—…๋กœ๋“œ
  • โœ… ๋ฌธ์„œ ์ž‘์„ฑ ์™„๋ฃŒ
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support