|
|
--- |
|
|
language: ca |
|
|
datasets: |
|
|
- projecte-aina/3catparla_asr |
|
|
- mozilla-foundation/common_voice_17_0 |
|
|
- projecte-aina/corts_valencianes_asr_a |
|
|
- projecte-aina/parlament_parla_v3_asr |
|
|
- projecte-aina/ib3_ca_asr |
|
|
- softcatala/catalan-youtube-speech |
|
|
- projecte-aina/annotated_catalan_common_voice_v17 |
|
|
tags: |
|
|
- hubert |
|
|
- catalan |
|
|
- audio |
|
|
- speech |
|
|
- projecte-aina |
|
|
- barcelona-supercomputing-center |
|
|
- bsc |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- facebook/hubert-base-ls960 |
|
|
metrics: |
|
|
- wer |
|
|
- f1 |
|
|
--- |
|
|
|
|
|
# Table of Contents |
|
|
|
|
|
<details> |
|
|
<summary>Click to expand</summary> |
|
|
|
|
|
- [Model Description](#model-description) |
|
|
- [Intended Uses and Limitations](#intended-uses-and-limitations) |
|
|
- [Pre-training Details](#pre-training-details) |
|
|
- [Indirect evaluation results](#indirect-evaluation-results) |
|
|
- [How to use the model](#how-to-use-the-model) |
|
|
- [Citation](#citation) |
|
|
- [Additional Information](#additional-information) |
|
|
|
|
|
</details> |
|
|
|
|
|
# Model Description |
|
|
|
|
|
This is a HuBERT Base model pre-trained using 1,778 hours of Catalan speech data. |
|
|
The model architecture is the same as the [original HuBERT Base model](https://huggingface.co/facebook/hubert-base-ls960), which contains 12 transformer layers. |
|
|
Pre-training was done by [Barcelona Supercomputing Center](https://bsc.es/). |
|
|
|
|
|
# Intended Uses and Limitations |
|
|
|
|
|
This pre-trained model generates Speech Representations that can be used for any Catalan speech-related task. |
|
|
This model does not have a tokenizer as it was pretrained on audio alone. |
|
|
|
|
|
In order to use this model for Automatic Speech Recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. |
|
|
Check out [this blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) for more in-detail explanation of how to fine-tune the model for Speech Recognition. |
|
|
For an explanation of how to fine-tune the model for Audio Classification, check out [this tutorial](https://huggingface.co/docs/transformers/main/en/tasks/audio_classification). |
|
|
|
|
|
# Pre-training Details |
|
|
|
|
|
This model was pre-trained using code from the [official repository](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert), and the detailed training configuration can be found in the same repository and the [original paper](https://ieeexplore.ieee.org/document/9585401). |
|
|
|
|
|
For pre-training, a 1,778 hours dataset was created using subsets from training splits from the following datasets: |
|
|
- [3CatParla (500 hours)](https://huggingface.co/datasets/projecte-aina/3catparla_asr) (This dataset is private and is planned to be published as public soon). |
|
|
- [commonvoice 17 (250 hours)](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) |
|
|
- [corts_valencianes (250 hours)](https://huggingface.co/datasets/projecte-aina/corts_valencianes_asr_a) (Only the anonymized version of the dataset is public. We trained the model with the non-anonymized version.) |
|
|
- [parlament_parla_v3 (250 hours)](https://huggingface.co/datasets/projecte-aina/parlament_parla_v3_asr) |
|
|
- [IB3 (28 hours)](https://huggingface.co/datasets/projecte-aina/ib3_ca_asr) |
|
|
- [Catalan YouTube Speech (500 hours)](https://huggingface.co/datasets/softcatala/catalan-youtube-speech) |
|
|
|
|
|
# Indirect evaluation results |
|
|
|
|
|
To assess the pre-trained Catalan Speech Representations' quality, we evaluated them using two indirect tasks: Catalan Automatic Speech Recognition (ASR) and Catalan Accent Classification. |
|
|
|
|
|
## Catalan Automatic Speech Recognition |
|
|
|
|
|
We created train and validation ASR-labelled datasets using a 100 hours subsample from the pre-training dataset split. |
|
|
For testing, we created a test split concatenating all the test splits from: |
|
|
- [3CatParla (4.5 hours)](https://huggingface.co/datasets/projecte-aina/3catparla_asr). |
|
|
- [commonvoice 17 (28 hours)](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) |
|
|
- [corts_valencianes (6 hours)](https://huggingface.co/datasets/projecte-aina/corts_valencianes_asr_a) (Only the anonymized version of the dataset is public. We trained the model with the non-anonymized version.) |
|
|
- [parlament_parla_v3 (27 hours)](https://huggingface.co/datasets/projecte-aina/parlament_parla_v3_asr) |
|
|
|
|
|
We fine-tuned on this ASR-labelled 100 hours training split the following models: |
|
|
- Catalan pre-trained HuBERT: [BSC-LT/hubert-base-ca-2k](https://huggingface.co/BSC-LT/hubert-base-ca-2k) (our model) |
|
|
- English pre-trained HuBERT: [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960) |
|
|
- Multi-lingual pre-trained HuBERT: [utter-project/mHuBERT-147](https://huggingface.co/utter-project/mHuBERT-147) |
|
|
- Multi-lingual pre-trained wav2vec2: [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) |
|
|
|
|
|
All of these models were pre-trained using exactly the same configurations. |
|
|
We trained them for 20 epochs, except wav2vec2-large-xlsr-53, that was trained for 10 epochs (to make it comparable to the others, because it takes double of time to train). |
|
|
For the fine-tuning process, we froze models' parameters using the freeze_feature_encoder() method. |
|
|
hubert-base-ca-2k, hubert-base-ls960 and mHuBERT-147 have 94M parameters, 95% of them were fine-tuned. wav2vec2-large-xlsr-53 has 311M parameters, 98% of them were fine-tuned. |
|
|
|
|
|
The results were the following: |
|
|
|
|
|
| Model | Train WER | Validation WER | Test WER ↑ | |
|
|
|------------------------|--------|-------|-------| |
|
|
| **hubert-base-ca-2k** | 5.1% | 9.6% | 12.1% | |
|
|
| mHuBERT-147 | 9.4% | 14.7% | 18.1% | |
|
|
| wav2vec2-large-xlsr-53 | 10.4% | 12.6% | 21.3% | |
|
|
| hubert-base-ls960 | 15.8% | 21.8% | 26.5% | |
|
|
|
|
|
## Catalan Accent Classification |
|
|
|
|
|
We created train, validation and test Catalan Accent Classification-labelled datasets using a 800 minutes (13 hours) subsample from the [projecte-aina/annotated_catalan_common_voice_v17](https://huggingface.co/datasets/projecte-aina/annotated_catalan_common_voice_v17) dataset. |
|
|
For each partition and accent, there is an important imbalance in the number of speakers and in the amount of hours available. |
|
|
We created new (smaller) splits assuring that: |
|
|
- Every accent has the same amount of speakers |
|
|
- Every speaker has at most 10 sentences (to avoid super-present speakers). |
|
|
As a result of that, we obtained balanced train (730 minutes), validation (30 minutes) and test (37 minutes) splits. |
|
|
We used the field “assigned_accent” as target label. |
|
|
This label can take the following values: "central", "northern", "northwestern", "valencian" or "balearic". |
|
|
|
|
|
We fine-tuned on this Catalan Accent Classification-labelled 800 minutes training split the following models: |
|
|
- Catalan pre-trained HuBERT: [BSC-LT/hubert-base-ca-2k](https://huggingface.co/BSC-LT/hubert-base-ca-2k) (our model) |
|
|
- English pre-trained HuBERT: [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960) |
|
|
- Multi-lingual pre-trained HuBERT: [utter-project/mHuBERT-147](https://huggingface.co/utter-project/mHuBERT-147) |
|
|
- Multi-lingual pre-trained wav2vec2: [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) |
|
|
|
|
|
All of these models were pre-trained using exactly the same configurations. |
|
|
We trained them for 10 epochs, except wav2vec2-large-xlsr-53, that was trained for 5 epochs (to make it comparable to the others, because it takes double of time to train). |
|
|
For the fine-tuning process, we froze models' parameters using the freeze_base_model() method. |
|
|
hubert-base-ca-2k, hubert-base-ls960 and mHuBERT-147 have 94M parameters, 0.2% of them were fine-tuned. wav2vec2-large-xlsr-53 has 311M parameters, 0.1% of them were fine-tuned. |
|
|
|
|
|
The results were the following: |
|
|
|
|
|
| Model | Train f1-macro | Validation f1-macro | Test f1-macro ↓ | |
|
|
|------------------------|--------|-------|-------| |
|
|
| hubert-base-ca-2k | 58.3% | 55.3% | 56.5% | |
|
|
| mHuBERT-147 | 40.7% | 36.6% | 34.0% | |
|
|
| hubert-base-ls960 | 40.6% | 34.2% | 33.6% | |
|
|
| wav2vec2-large-xlsr-53 | 6.7% | 6.6% | 6.7% | |
|
|
|
|
|
# How to use the model |
|
|
|
|
|
## Speech Representations |
|
|
|
|
|
To obtain Speech Representations (HuBERT outputs) from audio in Catalan using this model, you can follow this example: |
|
|
|
|
|
(Using fsspec==2025.3.0, datasets==3.6.0 and transformers==4.52.2 is recomended). |
|
|
|
|
|
```python |
|
|
|
|
|
from datasets import load_dataset, Audio |
|
|
import torch |
|
|
from transformers import AutoFeatureExtractor, AutoModel |
|
|
|
|
|
#Load the dataset |
|
|
dataset = load_dataset("projecte-aina/ib3_ca_asr", split='train[:1%]', trust_remote_code=True) |
|
|
|
|
|
#Downsample to 16kHz |
|
|
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000)) |
|
|
|
|
|
# Hugginface pre-trained model path |
|
|
MODEL_NAME = "BSC-LT/hubert-base-ca-2k" |
|
|
|
|
|
# Set device |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
print(f"Using {device} device.") |
|
|
|
|
|
# Load feature extractor |
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME) |
|
|
|
|
|
# Load model |
|
|
model = AutoModel.from_pretrained(MODEL_NAME) |
|
|
model = model.to(device) |
|
|
|
|
|
def map_to_speech_representations(batch): |
|
|
|
|
|
#Process the dataset |
|
|
audio = batch["audio"] |
|
|
input_features = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_values |
|
|
input_features = input_features.to(device) |
|
|
|
|
|
# Extract HuBERT's Speech Representations |
|
|
with torch.no_grad(): |
|
|
outputs = model( |
|
|
input_features, |
|
|
output_hidden_states = True, |
|
|
) |
|
|
speech_representations = outputs.last_hidden_state |
|
|
hidden_states = outputs.hidden_states |
|
|
|
|
|
batch["speech_representations"] = speech_representations |
|
|
batch["hidden_states"] = hidden_states |
|
|
|
|
|
return batch |
|
|
|
|
|
dataset = dataset.map(map_to_speech_representations) |
|
|
|
|
|
print(dataset) |
|
|
``` |
|
|
|
|
|
## Discrete Speech Representations |
|
|
|
|
|
Important remark: the k-means model available in this repo and used for extracting Discrete Speech Representations was trained using HuBERT's 6th layer. |
|
|
|
|
|
To obtain Discrete Speech Representations (HuBERT's k-means centroids) from audio in Catalan using this model, you can follow this example: |
|
|
|
|
|
(Using fsspec==2025.3.0, datasets==3.6.0 and transformers==4.52.2 is recomended). |
|
|
|
|
|
```python |
|
|
|
|
|
from datasets import load_dataset, Audio |
|
|
import torch |
|
|
from transformers import AutoFeatureExtractor, AutoModel |
|
|
import joblib |
|
|
import numpy as np |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
#Load the dataset |
|
|
dataset = load_dataset("projecte-aina/ib3_ca_asr", split='train[:1%]', trust_remote_code=True) |
|
|
|
|
|
#Downsample to 16kHz |
|
|
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000)) |
|
|
|
|
|
# Hugginface pre-trained model path |
|
|
MODEL_NAME = "BSC-LT/hubert-base-ca-2k" |
|
|
|
|
|
# Set device |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
print(f"Using {device} device.") |
|
|
|
|
|
# Load feature extractor |
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME) |
|
|
|
|
|
# Load model |
|
|
model = AutoModel.from_pretrained(MODEL_NAME) |
|
|
model = model.to(device) |
|
|
|
|
|
# Load k-means |
|
|
km_path = hf_hub_download(repo_id="BSC-LT/hubert-base-ca-2k", filename="k_means.km") |
|
|
km_model = joblib.load(km_path) |
|
|
clusters = km_model.cluster_centers_ |
|
|
|
|
|
def map_to_discrete_units(batch): |
|
|
|
|
|
#Process the dataset |
|
|
audio = batch["audio"] |
|
|
input_features = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_values |
|
|
input_features = input_features.to(device) |
|
|
|
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model( |
|
|
input_features, |
|
|
output_hidden_states = True, |
|
|
) |
|
|
|
|
|
# Extract HuBERT's Speech Representations |
|
|
hidden_states = outputs.hidden_states |
|
|
|
|
|
# Extract 6-th layer features |
|
|
k_means_input = hidden_states[5].squeeze() |
|
|
k_means_input = k_means_input.cpu() |
|
|
k_means_input = np.array(k_means_input, dtype='f') |
|
|
|
|
|
labels = km_model.predict(k_means_input) |
|
|
batch["discrete_units"] = clusters[labels] |
|
|
|
|
|
return batch |
|
|
|
|
|
dataset = dataset.map(map_to_discrete_units) |
|
|
|
|
|
print(dataset) |
|
|
|
|
|
``` |
|
|
|
|
|
## Automatic Speech Recognition |
|
|
|
|
|
In order to use this model for Speech Recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. |
|
|
Check out [this blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) for more in-detail explanation of how to fine-tune the model for Speech Recognition. |
|
|
|
|
|
## Audio Classification |
|
|
|
|
|
For an explanation of how to fine-tune the model for Audio Classification, check out [this tutorial](https://huggingface.co/docs/transformers/main/en/tasks/audio_classification). |
|
|
|
|
|
# Citation |
|
|
|
|
|
If this model contributes to your research, please cite the work: |
|
|
|
|
|
```bibtext |
|
|
@misc{costa2025hubertbaseca2k, |
|
|
title={CaHuBERT: the first full Catalan pre-trained HuBERT.}, |
|
|
author={Costa, Federico; Messaoudi, Abir; Peiró-Lilja, Alex; Casals-Salvador, Marc; España-Bonet, Cristina}, |
|
|
organization={Barcelona Supercomputing Center}, |
|
|
url={https://huggingface.co/BSC-LT/hubert-base-ca-2k}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
# Additional Information |
|
|
|
|
|
### Author |
|
|
|
|
|
The pre-training process was performed during 2025, in the [Language Technologies Unit](https://huggingface.co/BSC-LT) of the [Barcelona Supercomputing Center](https://www.bsc.es/). |
|
|
|
|
|
### Contact |
|
|
For further information, please send an email to <[email protected]>. |
|
|
|
|
|
### Copyright |
|
|
Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center. |
|
|
|
|
|
### License |
|
|
|
|
|
[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
|
|
### Funding |
|
|
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). |
|
|
|
|
|
The training of the model was possible thanks to the computing time provided by [Barcelona Supercomputing Center](https://www.bsc.es/) through MareNostrum 5. |
|
|
We acknowledge EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain. |
|
|
|
|
|
### Disclaimer |
|
|
|
|
|
<details> |
|
|
<summary>Click to expand</summary> |
|
|
|
|
|
The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0. |
|
|
|
|
|
Be aware that the model may have biases and/or any other undesirable distortions. |
|
|
|
|
|
When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) |
|
|
or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, |
|
|
in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence. |
|
|
|
|
|
In no event shall the owner and creator of the model (Barcelona Supercomputing Center) |
|
|
be liable for any results arising from the use made by third parties. |
|
|
|
|
|
</details> |