Post
				
				
							2080
					Smol TTS models are here! OuteTTS-0.1-350M - Zero shot voice cloning, built on LLaMa architecture, CC-BY license! π₯ 
> Pure language modeling approach to TTS
> Zero-shot voice cloning
> LLaMa architecture w/ Audio tokens (WavTokenizer)
> BONUS: Works on-device w/ llama.cpp β‘
Three-step approach to TTS:
> Audio tokenization using WavTokenizer (75 tok per second)
> CTC forced alignment for word-to-audio token mapping
> Structured prompt creation w/ transcription, duration, audio tokens
The model is extremely impressive for 350M parameters! Kudos to the
OuteAI team on such a brilliant feat - I'd love to see this be applied on larger data and smarter backbones like SmolLM π€
Check out the models here: OuteAI/outetts-6728aa71a53a076e4ba4817c
	> Pure language modeling approach to TTS
> Zero-shot voice cloning
> LLaMa architecture w/ Audio tokens (WavTokenizer)
> BONUS: Works on-device w/ llama.cpp β‘
Three-step approach to TTS:
> Audio tokenization using WavTokenizer (75 tok per second)
> CTC forced alignment for word-to-audio token mapping
> Structured prompt creation w/ transcription, duration, audio tokens
The model is extremely impressive for 350M parameters! Kudos to the
OuteAI team on such a brilliant feat - I'd love to see this be applied on larger data and smarter backbones like SmolLM π€
Check out the models here: OuteAI/outetts-6728aa71a53a076e4ba4817c
