--- base_model: unsloth/Llasa-1B tags: - text-to-speech - transformers - llama - trl - tts - unsloth license: apache-2.0 language: - pl pipeline_tag: text-to-speech datasets: - czyzi0/the-mc-speech-dataset --- # VoxPolska Auralis: Bringing Polish Speech to Life with Cutting-Edge TTS ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66a8afaf4bbd71186602585e/l3BO6tLJ-1h_qFdBkRNJF.png) ## 📌 Model Highlights - Context-Aware Voice: Generates speech that captures the nuances and tone of the Polish language. - Showcases advanced proficiency in speech synthesis and Polish language processing. - Converts written Polish text into natural, fluent, and expressive speech. - Advanced Deep Learning: Built using cutting-edge deep learning techniques for optimal performance. - State-of-the-Art Technology: Showcases advanced proficiency in speech synthesis and Polish language processing. ## 🔧 Technical Details - **Base Model:** Llasa TTS - **Optimized Fine-Tuning:** Employs LoRA (Low-Rank Adaptation) with high-rank setting for efficient and scalable adaptation. - **High-Fidelity Audio:** Produces clear 16 kHz sample rate audio, delivering crisp and realistic voice output. - **Dataset:** Trained with 24000+ Polish transcript and audio pairs - **Merge:** Merged 16 bit quantization - **Audio Decoding:** Customized layer-wise audio generation pipeline - **Repetition Penalty:** 1.1 to avoid repetitive phrases - **Robust Deep Learning:** Integrates advanced training optimizations including gradient checkpointing and mixed precision training. ## 🎧 Example Usage (Pipeline) - Here is an exampe code cell to run the model on a notebook: ```py !pip install transformers ipython from transformers import pipeline from IPython.display import Audio pipe = pipeline("text-to-speech", model="salihfurkaan/VoxPolska-Auralis") output = pipe("Cześć, jestem modelem sztucznej inteligencji mówiącym po polsku") Audio(output["audio"], rate=output["sampling_rate"]) ``` ## 🎧 Example Usage (Directly) - Here is an exampe code cell to run the model on a notebook: ```py !pip install --no-deps unsloth==2025.4.1 bitsandbytes unsloth_zoo trl==0.15.2 !pip install xcodec2==0.1.5 --no-deps !pip install vector_quantize_pytorch from unsloth import FastLanguageModel import torch from xcodec2.modeling_xcodec2 import XCodec2Model import torchaudio import soundfile as sf from IPython.display import display, Audio from transformers import AutoTokenizer, AutoModelForCausalLM input_text = "Cześć, jestem modelem sztucznej inteligencji mówiącym po polsku." XCODEC2_MODEL_NAME = "HKUST-Audio/xcodec2" SAMPLE_RATE = 16000 device = "cuda" if torch.cuda.is_available() else "cpu" codec_model = XCodec2Model.from_pretrained(XCODEC2_MODEL_NAME) codec_model = codec_model.to(device).eval() codec_model.to('cpu') tokenizer = AutoTokenizer.from_pretrained("salihfurkaan/VoxPolska-Auralis") model = AutoModelForCausalLM.from_pretrained("salihfurkaan/VoxPolska-Auralis") FastLanguageModel.for_inference(model) def ids_to_speech_tokens(speech_ids): speech_tokens_str = [] for speech_id in speech_ids: speech_tokens_str.append(f"<|s_{speech_id}|>") return speech_tokens_str def extract_speech_ids(speech_tokens_str): speech_ids = [] for token_str in speech_tokens_str: if token_str.startswith('<|s_') and token_str.endswith('|>'): num_str = token_str[4:-2] num = int(num_str) speech_ids.append(num) else: print(f"Unexpected token: {token_str}") return speech_ids with torch.inference_mode(): with torch.amp.autocast(device,dtype=model.dtype): formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>" chat = [ {"role": "user", "content": "Convert the text to speech:" + formatted_text}, {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"} ] input_ids = tokenizer.apply_chat_template( chat, tokenize=True, return_tensors='pt', continue_final_message=True ) speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>') # Generate the speech autoregressively outputs = model.generate( input_ids, max_length=2048, eos_token_id= speech_end_id , do_sample=True, top_p=1.2, # Adjusts the diversity of generated content temperature=1.2, # Controls randomness in output ) generated_ids = outputs[0][input_ids.shape[1]:-1] speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) speech_tokens = extract_speech_ids(speech_tokens) speech_tokens = torch.tensor(speech_tokens).cpu().unsqueeze(0).unsqueeze(0) gen_wav = codec_model.decode_code(speech_tokens) sf.write("output.wav", gen_wav[0, 0, :].cpu().numpy(), 16000) display(Audio(gen_wav[0, 0, :].cpu().numpy(), rate=16000)) ``` You can get your huggingface token from [here](https://huggingface.co/settings/tokens) ## 📫 Contact and Support For questions, suggestions, and feedback, please open an issue on HuggingFace. You can also reach the author via: LinkedIn ## Model Misuse Do not use this model for impersonation without consent, misinformation or deception (including fake news or fraudulent calls), or any illegal or harmful activity. By using this model, you agree to follow all applicable laws and ethical guidelines. ## Citation ```none @misc{ title={salihfurkaan/VoxPolska-Auralis}, author={Salih Furkan Erik}, year={2025}, url={https://huggingface.co/salihfurkaan/VoxPolska-Auralis/} } ```