File size: 14,457 Bytes
8afdb1f f9fedb5 1a64923 54328dc 1a64923 54328dc 557ccdb f9fedb5 1a64923 f9fedb5 1a64923 0e21528 f9fedb5 1a64923 92bb231 3df1e0a 1a64923 b386a5a dff8dac b386a5a 1a64923 f9fedb5 b386a5a 1a64923 8e28083 f9fedb5 b386a5a f9fedb5 8e28083 1a64923 3b24dd7 8e28083 1a64923 a8cf921 f9fedb5 b386a5a 54328dc aeee86f 54328dc b386a5a 54328dc 8e28083 69560b6 b386a5a 69560b6 3df1e0a 557ccdb 3df1e0a 0e21528 3df1e0a 588c9c5 8ba844f 8f1f406 3df1e0a 8ba844f 0995dd0 8ba844f 69560b6 b386a5a 69560b6 4793388 557ccdb aeee86f 4793388 557ccdb 4793388 0e21528 557ccdb 588c9c5 557ccdb 8f1f406 557ccdb 62e8f26 557ccdb 6998605 2e8dd4f 6998605 f9fedb5 b386a5a e0f91bf b386a5a 8e28083 737b3fc f9fedb5 8704ca6 737b3fc f9fedb5 737b3fc 1a64923 737b3fc 0e21528 737b3fc 8704ca6 1a64923 f9fedb5 b386a5a 8e28083 26fea78 8e28083 1a64923 8704ca6 737b3fc 1a64923 737b3fc 1a64923 737b3fc 1a64923 737b3fc 1a64923 737b3fc 0e21528 737b3fc f9fedb5 737b3fc 0e21528 737b3fc 8704ca6 737b3fc 8704ca6 737b3fc f9fedb5 b386a5a 8e28083 b386a5a 8e28083 b386a5a f9fedb5 1a64923 69560b6 1a64923 b386a5a 808d31c 1a64923 0e21528 1a64923 b386a5a 1a64923 69560b6 1a64923 f9fedb5 1a64923 abb061a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 |
---
language: ca
datasets:
- projecte-aina/3catparla_asr
- mozilla-foundation/common_voice_17_0
- projecte-aina/corts_valencianes_asr_a
- projecte-aina/parlament_parla_v3_asr
- projecte-aina/ib3_ca_asr
- softcatala/catalan-youtube-speech
- projecte-aina/annotated_catalan_common_voice_v17
tags:
- hubert
- catalan
- audio
- speech
- projecte-aina
- barcelona-supercomputing-center
- bsc
license: apache-2.0
base_model:
- facebook/hubert-base-ls960
metrics:
- wer
- f1
---
# Table of Contents
<details>
<summary>Click to expand</summary>
- [Model Description](#model-description)
- [Intended Uses and Limitations](#intended-uses-and-limitations)
- [Pre-training Details](#pre-training-details)
- [Indirect evaluation results](#indirect-evaluation-results)
- [How to use the model](#how-to-use-the-model)
- [Citation](#citation)
- [Additional Information](#additional-information)
</details>
# Model Description
This is a HuBERT Base model pre-trained using 1,778 hours of Catalan speech data.
The model architecture is the same as the [original HuBERT Base model](https://huggingface.co/facebook/hubert-base-ls960), which contains 12 transformer layers.
Pre-training was done by [Barcelona Supercomputing Center](https://bsc.es/).
# Intended Uses and Limitations
This pre-trained model generates Speech Representations that can be used for any Catalan speech-related task.
This model does not have a tokenizer as it was pretrained on audio alone.
In order to use this model for Automatic Speech Recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data.
Check out [this blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) for more in-detail explanation of how to fine-tune the model for Speech Recognition.
For an explanation of how to fine-tune the model for Audio Classification, check out [this tutorial](https://huggingface.co/docs/transformers/main/en/tasks/audio_classification).
# Pre-training Details
This model was pre-trained using code from the [official repository](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert), and the detailed training configuration can be found in the same repository and the [original paper](https://ieeexplore.ieee.org/document/9585401).
For pre-training, a 1,778 hours dataset was created using subsets from training splits from the following datasets:
- [3CatParla (500 hours)](https://huggingface.co/datasets/projecte-aina/3catparla_asr) (This dataset is private and is planned to be published as public soon).
- [commonvoice 17 (250 hours)](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
- [corts_valencianes (250 hours)](https://huggingface.co/datasets/projecte-aina/corts_valencianes_asr_a) (Only the anonymized version of the dataset is public. We trained the model with the non-anonymized version.)
- [parlament_parla_v3 (250 hours)](https://huggingface.co/datasets/projecte-aina/parlament_parla_v3_asr)
- [IB3 (28 hours)](https://huggingface.co/datasets/projecte-aina/ib3_ca_asr)
- [Catalan YouTube Speech (500 hours)](https://huggingface.co/datasets/softcatala/catalan-youtube-speech)
# Indirect evaluation results
To assess the pre-trained Catalan Speech Representations' quality, we evaluated them using two indirect tasks: Catalan Automatic Speech Recognition (ASR) and Catalan Accent Classification.
## Catalan Automatic Speech Recognition
We created train and validation ASR-labelled datasets using a 100 hours subsample from the pre-training dataset split.
For testing, we created a test split concatenating all the test splits from:
- [3CatParla (4.5 hours)](https://huggingface.co/datasets/projecte-aina/3catparla_asr).
- [commonvoice 17 (28 hours)](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
- [corts_valencianes (6 hours)](https://huggingface.co/datasets/projecte-aina/corts_valencianes_asr_a) (Only the anonymized version of the dataset is public. We trained the model with the non-anonymized version.)
- [parlament_parla_v3 (27 hours)](https://huggingface.co/datasets/projecte-aina/parlament_parla_v3_asr)
We fine-tuned on this ASR-labelled 100 hours training split the following models:
- Catalan pre-trained HuBERT: [BSC-LT/hubert-base-ca-2k](https://huggingface.co/BSC-LT/hubert-base-ca-2k) (our model)
- English pre-trained HuBERT: [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960)
- Multi-lingual pre-trained HuBERT: [utter-project/mHuBERT-147](https://huggingface.co/utter-project/mHuBERT-147)
- Multi-lingual pre-trained wav2vec2: [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
All of these models were pre-trained using exactly the same configurations.
We trained them for 20 epochs, except wav2vec2-large-xlsr-53, that was trained for 10 epochs (to make it comparable to the others, because it takes double of time to train).
For the fine-tuning process, we froze models' parameters using the freeze_feature_encoder() method.
hubert-base-ca-2k, hubert-base-ls960 and mHuBERT-147 have 94M parameters, 95% of them were fine-tuned. wav2vec2-large-xlsr-53 has 311M parameters, 98% of them were fine-tuned.
The results were the following:
| Model | Train WER | Validation WER | Test WER ↑ |
|------------------------|--------|-------|-------|
| **hubert-base-ca-2k** | 5.1% | 9.6% | 12.1% |
| mHuBERT-147 | 9.4% | 14.7% | 18.1% |
| wav2vec2-large-xlsr-53 | 10.4% | 12.6% | 21.3% |
| hubert-base-ls960 | 15.8% | 21.8% | 26.5% |
## Catalan Accent Classification
We created train, validation and test Catalan Accent Classification-labelled datasets using a 800 minutes (13 hours) subsample from the [projecte-aina/annotated_catalan_common_voice_v17](https://huggingface.co/datasets/projecte-aina/annotated_catalan_common_voice_v17) dataset.
For each partition and accent, there is an important imbalance in the number of speakers and in the amount of hours available.
We created new (smaller) splits assuring that:
- Every accent has the same amount of speakers
- Every speaker has at most 10 sentences (to avoid super-present speakers).
As a result of that, we obtained balanced train (730 minutes), validation (30 minutes) and test (37 minutes) splits.
We used the field “assigned_accent” as target label.
This label can take the following values: "central", "northern", "northwestern", "valencian" or "balearic".
We fine-tuned on this Catalan Accent Classification-labelled 800 minutes training split the following models:
- Catalan pre-trained HuBERT: [BSC-LT/hubert-base-ca-2k](https://huggingface.co/BSC-LT/hubert-base-ca-2k) (our model)
- English pre-trained HuBERT: [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960)
- Multi-lingual pre-trained HuBERT: [utter-project/mHuBERT-147](https://huggingface.co/utter-project/mHuBERT-147)
- Multi-lingual pre-trained wav2vec2: [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
All of these models were pre-trained using exactly the same configurations.
We trained them for 10 epochs, except wav2vec2-large-xlsr-53, that was trained for 5 epochs (to make it comparable to the others, because it takes double of time to train).
For the fine-tuning process, we froze models' parameters using the freeze_base_model() method.
hubert-base-ca-2k, hubert-base-ls960 and mHuBERT-147 have 94M parameters, 0.2% of them were fine-tuned. wav2vec2-large-xlsr-53 has 311M parameters, 0.1% of them were fine-tuned.
The results were the following:
| Model | Train f1-macro | Validation f1-macro | Test f1-macro ↓ |
|------------------------|--------|-------|-------|
| hubert-base-ca-2k | 58.3% | 55.3% | 56.5% |
| mHuBERT-147 | 40.7% | 36.6% | 34.0% |
| hubert-base-ls960 | 40.6% | 34.2% | 33.6% |
| wav2vec2-large-xlsr-53 | 6.7% | 6.6% | 6.7% |
# How to use the model
## Speech Representations
To obtain Speech Representations (HuBERT outputs) from audio in Catalan using this model, you can follow this example:
(Using fsspec==2025.3.0, datasets==3.6.0 and transformers==4.52.2 is recomended).
```python
from datasets import load_dataset, Audio
import torch
from transformers import AutoFeatureExtractor, AutoModel
#Load the dataset
dataset = load_dataset("projecte-aina/ib3_ca_asr", split='train[:1%]', trust_remote_code=True)
#Downsample to 16kHz
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
# Hugginface pre-trained model path
MODEL_NAME = "BSC-LT/hubert-base-ca-2k"
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device} device.")
# Load feature extractor
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME)
# Load model
model = AutoModel.from_pretrained(MODEL_NAME)
model = model.to(device)
def map_to_speech_representations(batch):
#Process the dataset
audio = batch["audio"]
input_features = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_values
input_features = input_features.to(device)
# Extract HuBERT's Speech Representations
with torch.no_grad():
outputs = model(
input_features,
output_hidden_states = True,
)
speech_representations = outputs.last_hidden_state
hidden_states = outputs.hidden_states
batch["speech_representations"] = speech_representations
batch["hidden_states"] = hidden_states
return batch
dataset = dataset.map(map_to_speech_representations)
print(dataset)
```
## Discrete Speech Representations
Important remark: the k-means model available in this repo and used for extracting Discrete Speech Representations was trained using HuBERT's 6th layer.
To obtain Discrete Speech Representations (HuBERT's k-means centroids) from audio in Catalan using this model, you can follow this example:
(Using fsspec==2025.3.0, datasets==3.6.0 and transformers==4.52.2 is recomended).
```python
from datasets import load_dataset, Audio
import torch
from transformers import AutoFeatureExtractor, AutoModel
import joblib
import numpy as np
from huggingface_hub import hf_hub_download
#Load the dataset
dataset = load_dataset("projecte-aina/ib3_ca_asr", split='train[:1%]', trust_remote_code=True)
#Downsample to 16kHz
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))
# Hugginface pre-trained model path
MODEL_NAME = "BSC-LT/hubert-base-ca-2k"
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device} device.")
# Load feature extractor
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME)
# Load model
model = AutoModel.from_pretrained(MODEL_NAME)
model = model.to(device)
# Load k-means
km_path = hf_hub_download(repo_id="BSC-LT/hubert-base-ca-2k", filename="k_means.km")
km_model = joblib.load(km_path)
clusters = km_model.cluster_centers_
def map_to_discrete_units(batch):
#Process the dataset
audio = batch["audio"]
input_features = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_values
input_features = input_features.to(device)
with torch.no_grad():
outputs = model(
input_features,
output_hidden_states = True,
)
# Extract HuBERT's Speech Representations
hidden_states = outputs.hidden_states
# Extract 6-th layer features
k_means_input = hidden_states[5].squeeze()
k_means_input = k_means_input.cpu()
k_means_input = np.array(k_means_input, dtype='f')
labels = km_model.predict(k_means_input)
batch["discrete_units"] = clusters[labels]
return batch
dataset = dataset.map(map_to_discrete_units)
print(dataset)
```
## Automatic Speech Recognition
In order to use this model for Speech Recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data.
Check out [this blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) for more in-detail explanation of how to fine-tune the model for Speech Recognition.
## Audio Classification
For an explanation of how to fine-tune the model for Audio Classification, check out [this tutorial](https://huggingface.co/docs/transformers/main/en/tasks/audio_classification).
# Citation
If this model contributes to your research, please cite the work:
```bibtext
@misc{costa2025hubertbaseca2k,
title={CaHuBERT: the first full Catalan pre-trained HuBERT.},
author={Costa, Federico; Messaoudi, Abir; Peiró-Lilja, Alex; Casals-Salvador, Marc; España-Bonet, Cristina},
organization={Barcelona Supercomputing Center},
url={https://huggingface.co/BSC-LT/hubert-base-ca-2k},
year={2025}
}
```
# Additional Information
### Author
The pre-training process was performed during 2025, in the [Language Technologies Unit](https://huggingface.co/BSC-LT) of the [Barcelona Supercomputing Center](https://www.bsc.es/).
### Contact
For further information, please send an email to <[email protected]>.
### Copyright
Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center.
### License
[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
### Funding
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).
The training of the model was possible thanks to the computing time provided by [Barcelona Supercomputing Center](https://www.bsc.es/) through MareNostrum 5.
We acknowledge EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain.
### Disclaimer
<details>
<summary>Click to expand</summary>
The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
Be aware that the model may have biases and/or any other undesirable distortions.
When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
be liable for any results arising from the use made by third parties.
</details> |