File size: 14,457 Bytes
8afdb1f
f9fedb5
1a64923
 
54328dc
1a64923
54328dc
 
 
557ccdb
f9fedb5
 
1a64923
 
f9fedb5
1a64923
 
 
 
 
 
0e21528
 
 
f9fedb5
1a64923
92bb231
3df1e0a
1a64923
 
 
b386a5a
dff8dac
b386a5a
 
 
 
 
1a64923
 
f9fedb5
b386a5a
1a64923
 
 
8e28083
f9fedb5
b386a5a
f9fedb5
8e28083
1a64923
3b24dd7
8e28083
1a64923
a8cf921
f9fedb5
b386a5a
54328dc
 
 
 
aeee86f
54328dc
 
 
 
 
 
b386a5a
54328dc
8e28083
69560b6
b386a5a
69560b6
3df1e0a
 
 
557ccdb
3df1e0a
 
 
 
0e21528
3df1e0a
 
 
 
588c9c5
8ba844f
8f1f406
 
3df1e0a
 
 
 
8ba844f
0995dd0
8ba844f
 
 
69560b6
b386a5a
69560b6
4793388
557ccdb
 
 
aeee86f
4793388
557ccdb
 
 
4793388
0e21528
557ccdb
 
 
 
588c9c5
557ccdb
8f1f406
 
557ccdb
 
 
62e8f26
557ccdb
6998605
2e8dd4f
 
6998605
f9fedb5
b386a5a
e0f91bf
b386a5a
8e28083
737b3fc
f9fedb5
8704ca6
 
737b3fc
f9fedb5
737b3fc
 
 
1a64923
737b3fc
 
 
 
 
 
 
0e21528
737b3fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8704ca6
 
 
1a64923
f9fedb5
b386a5a
8e28083
26fea78
 
8e28083
1a64923
8704ca6
 
737b3fc
1a64923
737b3fc
1a64923
737b3fc
 
 
 
1a64923
 
737b3fc
1a64923
 
737b3fc
 
 
0e21528
737b3fc
 
 
 
 
 
 
f9fedb5
737b3fc
 
 
 
 
0e21528
737b3fc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8704ca6
 
737b3fc
 
 
8704ca6
 
 
737b3fc
 
f9fedb5
b386a5a
8e28083
 
 
 
b386a5a
8e28083
 
 
b386a5a
f9fedb5
1a64923
69560b6
1a64923
b386a5a
 
808d31c
1a64923
0e21528
1a64923
 
 
 
b386a5a
1a64923
 
 
69560b6
1a64923
 
 
 
 
 
 
 
 
 
 
 
 
f9fedb5
1a64923
abb061a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
---
language: ca
datasets:
- projecte-aina/3catparla_asr
- mozilla-foundation/common_voice_17_0
- projecte-aina/corts_valencianes_asr_a
- projecte-aina/parlament_parla_v3_asr
- projecte-aina/ib3_ca_asr
- softcatala/catalan-youtube-speech
- projecte-aina/annotated_catalan_common_voice_v17
tags:
- hubert
- catalan
- audio
- speech
- projecte-aina
- barcelona-supercomputing-center
- bsc
license: apache-2.0
base_model:
- facebook/hubert-base-ls960
metrics:
- wer
- f1
---

# Table of Contents

<details>
<summary>Click to expand</summary>

- [Model Description](#model-description)
- [Intended Uses and Limitations](#intended-uses-and-limitations)
- [Pre-training Details](#pre-training-details)
- [Indirect evaluation results](#indirect-evaluation-results)
- [How to use the model](#how-to-use-the-model)
- [Citation](#citation)
- [Additional Information](#additional-information)

</details>

# Model Description

This is a HuBERT Base model pre-trained using 1,778 hours of Catalan speech data.
The model architecture is the same as the [original HuBERT Base model](https://huggingface.co/facebook/hubert-base-ls960), which contains 12 transformer layers.
Pre-training was done by [Barcelona Supercomputing Center](https://bsc.es/).

# Intended Uses and Limitations

This pre-trained model generates Speech Representations that can be used for any Catalan speech-related task.
This model does not have a tokenizer as it was pretrained on audio alone. 

In order to use this model for Automatic Speech Recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. 
Check out [this blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) for more in-detail explanation of how to fine-tune the model for Speech Recognition.
For an explanation of how to fine-tune the model for Audio Classification, check out [this tutorial](https://huggingface.co/docs/transformers/main/en/tasks/audio_classification).

# Pre-training Details

This model was pre-trained using code from the [official repository](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert), and the detailed training configuration can be found in the same repository and the [original paper](https://ieeexplore.ieee.org/document/9585401).

For pre-training, a 1,778 hours dataset was created using subsets from training splits from the following datasets:
- [3CatParla (500 hours)](https://huggingface.co/datasets/projecte-aina/3catparla_asr) (This dataset is private and is planned to be published as public soon).
- [commonvoice 17 (250 hours)](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
- [corts_valencianes (250 hours)](https://huggingface.co/datasets/projecte-aina/corts_valencianes_asr_a) (Only the anonymized version of the dataset is public. We trained the model with the non-anonymized version.)
- [parlament_parla_v3 (250 hours)](https://huggingface.co/datasets/projecte-aina/parlament_parla_v3_asr) 
- [IB3 (28 hours)](https://huggingface.co/datasets/projecte-aina/ib3_ca_asr)
- [Catalan YouTube Speech (500 hours)](https://huggingface.co/datasets/softcatala/catalan-youtube-speech)
  
# Indirect evaluation results

To assess the pre-trained Catalan Speech Representations' quality, we evaluated them using two indirect tasks: Catalan Automatic Speech Recognition (ASR) and Catalan Accent Classification.

## Catalan Automatic Speech Recognition

We created train and validation ASR-labelled datasets using a 100 hours subsample from the pre-training dataset split.
For testing, we created a test split concatenating all the test splits from: 
- [3CatParla (4.5 hours)](https://huggingface.co/datasets/projecte-aina/3catparla_asr).
- [commonvoice 17 (28 hours)](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
- [corts_valencianes (6 hours)](https://huggingface.co/datasets/projecte-aina/corts_valencianes_asr_a) (Only the anonymized version of the dataset is public. We trained the model with the non-anonymized version.)
- [parlament_parla_v3 (27 hours)](https://huggingface.co/datasets/projecte-aina/parlament_parla_v3_asr)

We fine-tuned on this ASR-labelled 100 hours training split the following models: 
- Catalan pre-trained HuBERT: [BSC-LT/hubert-base-ca-2k](https://huggingface.co/BSC-LT/hubert-base-ca-2k) (our model)
- English pre-trained HuBERT: [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960)
- Multi-lingual pre-trained HuBERT: [utter-project/mHuBERT-147](https://huggingface.co/utter-project/mHuBERT-147)
- Multi-lingual pre-trained wav2vec2: [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)

All of these models were pre-trained using exactly the same configurations.
We trained them for 20 epochs, except wav2vec2-large-xlsr-53, that was trained for 10 epochs (to make it comparable to the others, because it takes double of time to train).
For the fine-tuning process, we froze models' parameters using the freeze_feature_encoder() method.
hubert-base-ca-2k, hubert-base-ls960 and mHuBERT-147 have 94M parameters, 95% of them were fine-tuned. wav2vec2-large-xlsr-53 has 311M parameters, 98% of them were fine-tuned.

The results were the following:

| Model | Train WER | Validation WER | Test WER ↑ | 
|------------------------|--------|-------|-------|
| **hubert-base-ca-2k**  | 5.1%   | 9.6%  | 12.1% |
| mHuBERT-147            | 9.4%   | 14.7% | 18.1% |
| wav2vec2-large-xlsr-53 | 10.4%  | 12.6% | 21.3% |
| hubert-base-ls960      | 15.8%  | 21.8% | 26.5% |

## Catalan Accent Classification

We created train, validation and test Catalan Accent Classification-labelled datasets using a 800 minutes (13 hours) subsample from the [projecte-aina/annotated_catalan_common_voice_v17](https://huggingface.co/datasets/projecte-aina/annotated_catalan_common_voice_v17) dataset.
For each partition and accent, there is an important imbalance in the number of speakers and in the amount of hours available.
We created new (smaller) splits assuring that:
- Every accent has the same amount of speakers
- Every speaker has at most 10 sentences (to avoid super-present speakers).
As a result of that, we obtained balanced train (730 minutes), validation (30 minutes) and test (37 minutes) splits.
We used the field “assigned_accent” as target label.
This label can take the following values: "central", "northern", "northwestern", "valencian" or "balearic". 

We fine-tuned on this Catalan Accent Classification-labelled 800 minutes training split the following models: 
- Catalan pre-trained HuBERT: [BSC-LT/hubert-base-ca-2k](https://huggingface.co/BSC-LT/hubert-base-ca-2k) (our model)
- English pre-trained HuBERT: [facebook/hubert-base-ls960](https://huggingface.co/facebook/hubert-base-ls960)
- Multi-lingual pre-trained HuBERT: [utter-project/mHuBERT-147](https://huggingface.co/utter-project/mHuBERT-147)
- Multi-lingual pre-trained wav2vec2: [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)

All of these models were pre-trained using exactly the same configurations.
We trained them for 10 epochs, except wav2vec2-large-xlsr-53, that was trained for 5 epochs (to make it comparable to the others, because it takes double of time to train).
For the fine-tuning process, we froze models' parameters using the freeze_base_model() method.
hubert-base-ca-2k, hubert-base-ls960 and mHuBERT-147 have 94M parameters, 0.2% of them were fine-tuned. wav2vec2-large-xlsr-53 has 311M parameters, 0.1% of them were fine-tuned.

The results were the following:

| Model | Train f1-macro | Validation f1-macro | Test f1-macro ↓ | 
|------------------------|--------|-------|-------|
| hubert-base-ca-2k      | 58.3%  | 55.3% | 56.5% |
| mHuBERT-147            | 40.7%  | 36.6% | 34.0% |
| hubert-base-ls960      | 40.6%  | 34.2% | 33.6% |
| wav2vec2-large-xlsr-53 | 6.7%   | 6.6%  | 6.7%  |

# How to use the model

## Speech Representations

To obtain Speech Representations (HuBERT outputs) from audio in Catalan using this model, you can follow this example:

(Using fsspec==2025.3.0, datasets==3.6.0 and transformers==4.52.2 is recomended).

```python

from datasets import load_dataset, Audio
import torch
from transformers import AutoFeatureExtractor, AutoModel

#Load the dataset
dataset = load_dataset("projecte-aina/ib3_ca_asr", split='train[:1%]', trust_remote_code=True)

#Downsample to 16kHz
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))

# Hugginface pre-trained model path
MODEL_NAME = "BSC-LT/hubert-base-ca-2k"

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device} device.")

# Load feature extractor
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME)

# Load model
model = AutoModel.from_pretrained(MODEL_NAME)
model = model.to(device)

def map_to_speech_representations(batch):

  #Process the dataset
  audio = batch["audio"]
  input_features = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_values
  input_features = input_features.to(device)

  # Extract HuBERT's Speech Representations
  with torch.no_grad():
    outputs = model(
        input_features,
        output_hidden_states = True,
        )
    speech_representations = outputs.last_hidden_state
    hidden_states = outputs.hidden_states

  batch["speech_representations"] = speech_representations
  batch["hidden_states"] = hidden_states

  return batch

dataset = dataset.map(map_to_speech_representations)

print(dataset)
```

## Discrete Speech Representations

Important remark: the k-means model available in this repo and used for extracting Discrete Speech Representations was trained using HuBERT's 6th layer.

To obtain Discrete Speech Representations (HuBERT's k-means centroids) from audio in Catalan using this model, you can follow this example:

(Using fsspec==2025.3.0, datasets==3.6.0 and transformers==4.52.2 is recomended).

```python

from datasets import load_dataset, Audio
import torch
from transformers import AutoFeatureExtractor, AutoModel
import joblib
import numpy as np
from huggingface_hub import hf_hub_download

#Load the dataset
dataset = load_dataset("projecte-aina/ib3_ca_asr", split='train[:1%]', trust_remote_code=True)

#Downsample to 16kHz
dataset = dataset.cast_column("audio", Audio(sampling_rate=16_000))

# Hugginface pre-trained model path
MODEL_NAME = "BSC-LT/hubert-base-ca-2k"

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using {device} device.")

# Load feature extractor
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME)

# Load model
model = AutoModel.from_pretrained(MODEL_NAME)
model = model.to(device)

# Load k-means
km_path = hf_hub_download(repo_id="BSC-LT/hubert-base-ca-2k", filename="k_means.km")
km_model = joblib.load(km_path)
clusters = km_model.cluster_centers_

def map_to_discrete_units(batch):

  #Process the dataset
  audio = batch["audio"]
  input_features = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_values
  input_features = input_features.to(device)

  
  with torch.no_grad():
    outputs = model(
        input_features,
        output_hidden_states = True,
        )
    
    # Extract HuBERT's Speech Representations
    hidden_states = outputs.hidden_states

    # Extract 6-th layer features
    k_means_input = hidden_states[5].squeeze()
    k_means_input = k_means_input.cpu()
    k_means_input = np.array(k_means_input, dtype='f')
    
    labels = km_model.predict(k_means_input)
    batch["discrete_units"] = clusters[labels]

  return batch

dataset = dataset.map(map_to_discrete_units)

print(dataset)

```

## Automatic Speech Recognition

In order to use this model for Speech Recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data. 
Check out [this blog](https://huggingface.co/blog/fine-tune-wav2vec2-english) for more in-detail explanation of how to fine-tune the model for Speech Recognition.

## Audio Classification

For an explanation of how to fine-tune the model for Audio Classification, check out [this tutorial](https://huggingface.co/docs/transformers/main/en/tasks/audio_classification).

# Citation

If this model contributes to your research, please cite the work:

```bibtext
@misc{costa2025hubertbaseca2k,
      title={CaHuBERT: the first full Catalan pre-trained HuBERT.}, 
      author={Costa, Federico; Messaoudi, Abir; Peiró-Lilja, Alex; Casals-Salvador, Marc; España-Bonet, Cristina},
      organization={Barcelona Supercomputing Center},
      url={https://huggingface.co/BSC-LT/hubert-base-ca-2k},
      year={2025}
}
```

# Additional Information

### Author

The pre-training process was performed during 2025, in the [Language Technologies Unit](https://huggingface.co/BSC-LT) of the [Barcelona Supercomputing Center](https://www.bsc.es/).

### Contact
For further information, please send an email to <[email protected]>.

### Copyright
Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center.

### License

[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)

### Funding
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/).

The training of the model was possible thanks to the computing time provided by [Barcelona Supercomputing Center](https://www.bsc.es/) through MareNostrum 5.
We acknowledge EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain.

### Disclaimer

<details>
<summary>Click to expand</summary>

The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0. 

Be aware that the model may have biases and/or any other undesirable distortions.

When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) 
or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, 
in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owner and creator of the model (Barcelona Supercomputing Center) 
be liable for any results arising from the use made by third parties.

</details>