Fine-tuning openaudio-s1-mini for Persian (Farsi) – Feasibility & Guidance

#4
by arshambz - opened

Hi and thanks for sharing this great model!

I'm interested in fine-tuning fishaudio/openaudio-s1-mini for Persian (Farsi) audio tasks, such as speech recognition or audio classification. I’d appreciate your guidance on a few key points:

Is fine-tuning for a new language like Persian supported or practical with this model?
Given that it's trained in English, I’d like to know how transferable the learned representations are to a different language, especially a low-resource one.

Roughly how much audio data would be needed for a meaningful fine-tuning on Persian?
I understand it depends on the task and setup, but a ballpark estimate would help a lot (e.g., hours of audio, number of samples, etc.).

Are there any recommended training settings or constraints (batch size, LR, augmentation, etc.) that you found important when fine-tuning this architecture?

Does the model architecture support freezing early layers, or is end-to-end fine-tuning preferable?

Finally, do you provide or suggest any starter scripts, notebooks, or best practices for fine-tuning?

I’d really appreciate any help or pointers. Thank you in advance for your work and time!

Fish Audio org
β€’
edited Jun 10

WIP, maybe we'll update the finetune part in July.

Any update on how to finetune the s1-mini model?
The previous docs on finetuning is not available anymore. I tried to update the code to make it work (the code there to finetune was for fish-speech 1.5) but didn't make it work. Would be nice to have the possibility to finetune it, especially for languages where the accent is incorrect (Russian, Japanese). That's a great model for its size to be honest. Would love to push it further.

Sign up or log in to comment