Update README.md (#12)
Browse files- Update README.md (8f88c5e3032c736a67d6ac696f21f3f25d2bc390)
- Update README.md (00acca2c94f2c6c8150d1cd1b858e6447b16102c)
- Update README.md (19f404348f18bbccdf8b9da95722d65cf6e27ddf)
Co-authored-by: Maha Elbayad <[email protected]>
README.md
CHANGED
|
@@ -1,11 +1,116 @@
|
|
| 1 |
---
|
| 2 |
-
inference: false
|
| 3 |
-
tags:
|
| 4 |
-
- SeamlessM4T
|
| 5 |
license: cc-by-nc-4.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
library_name: fairseq2
|
| 7 |
---
|
| 8 |
-
|
| 9 |
# SeamlessM4T Medium
|
| 10 |
|
| 11 |
SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different
|
|
@@ -18,14 +123,15 @@ SeamlessM4T covers:
|
|
| 18 |
|
| 19 |
-------------------
|
| 20 |
|
| 21 |
-
**🌟 SeamlessM4T v2, an improved version of this version with a novel architecture, has been released [here](https://huggingface.co/facebook/seamless-m4t-v2-large)
|
| 22 |
-
|
|
|
|
| 23 |
|
| 24 |
**SeamlessM4T v2 is also supported by 🤗 Transformers, more on it [in the model card of this new version](https://huggingface.co/facebook/seamless-m4t-v2-large#transformers-usage) or directly in [🤗 Transformers docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2).**
|
| 25 |
|
| 26 |
-------------------
|
| 27 |
|
| 28 |
-
This is the "medium" variant of
|
| 29 |
- Speech-to-speech translation (S2ST)
|
| 30 |
- Speech-to-text translation (S2TT)
|
| 31 |
- Text-to-speech translation (T2ST)
|
|
@@ -33,22 +139,23 @@ This is the "medium" variant of the unified model, which enables multiple tasks
|
|
| 33 |
- Automatic speech recognition (ASR)
|
| 34 |
|
| 35 |
## SeamlessM4T models
|
| 36 |
-
|
| 37 |
| Model Name | #params | checkpoint | metrics |
|
| 38 |
| ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
|
| 39 |
-
| SeamlessM4T-Large
|
| 40 |
-
| SeamlessM4T-
|
|
|
|
| 41 |
|
| 42 |
-
We provide extensive evaluation results of SeamlessM4T
|
| 43 |
|
| 44 |
## 🤗 Transformers Usage
|
| 45 |
|
| 46 |
First, load the processor and a checkpoint of the model:
|
| 47 |
|
| 48 |
```python
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
|
|
|
| 52 |
```
|
| 53 |
|
| 54 |
You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
|
|
@@ -56,110 +163,62 @@ We provide extensive evaluation results of SeamlessM4T-Medium and SeamlessM4T-La
|
|
| 56 |
Here is how to use the processor to process text and audio:
|
| 57 |
|
| 58 |
```python
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
>>> # now, process it
|
| 64 |
-
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
|
| 65 |
-
>>> # now, process some English test as well
|
| 66 |
-
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
|
| 67 |
-
```
|
| 68 |
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
### Speech
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
```python
|
| 75 |
-
|
| 76 |
-
|
| 77 |
```
|
| 78 |
|
| 79 |
-
With basically the same code, I've translated English text and Arabic speech to Russian speech samples.
|
| 80 |
-
|
| 81 |
### Text
|
| 82 |
|
| 83 |
-
Similarly, you can generate translated text from audio files or from text with the same model.
|
| 84 |
-
|
| 85 |
|
| 86 |
```python
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
>>> # from text
|
| 91 |
-
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
|
| 92 |
-
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
|
| 93 |
-
```
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
## Instructions to run inference with SeamlessM4T models
|
| 98 |
-
|
| 99 |
-
The SeamlessM4T models are currently available through the `seamless_communication` package. The `seamless_communication`
|
| 100 |
-
package can be installed by following the instructions outlined here: [Installation](https://github.com/facebookresearch/seamless_communication/tree/main#installation).
|
| 101 |
-
|
| 102 |
-
Once installed, a [`Translator`](https://github.com/facebookresearch/seamless_communication/blob/590547965b343b590d15847a0aa25a6779fc3753/src/seamless_communication/models/inference/translator.py#L47)
|
| 103 |
-
object can be instantiated to perform all five of the spoken langauge tasks. The `Translator` is instantiated with three arguments:
|
| 104 |
-
1. **model_name_or_card**: SeamlessM4T checkpoint. Can be either `seamlessM4T_medium` for the medium model, or `seamlessM4T_large` for the large model
|
| 105 |
-
2. **vocoder_name_or_card**: vocoder checkpoint (`vocoder_36langs`)
|
| 106 |
-
3. **device**: Torch device
|
| 107 |
-
|
| 108 |
-
```python
|
| 109 |
-
import torch
|
| 110 |
-
from seamless_communication.models.inference import Translator
|
| 111 |
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
```
|
| 116 |
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`, we can translate
|
| 120 |
-
into `<tgt_lang>` as follows.
|
| 121 |
-
|
| 122 |
-
### S2ST and T2ST:
|
| 123 |
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
# T2ST
|
| 129 |
-
translated_text, wav, sr = translator.predict(<input_text>, "t2st", <tgt_lang>, src_lang=<src_lang>)
|
| 130 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
torchaudio.save(
|
| 141 |
-
<path_to_save_audio>,
|
| 142 |
-
wav[0].cpu(),
|
| 143 |
-
sample_rate=sr,
|
| 144 |
)
|
| 145 |
```
|
| 146 |
|
| 147 |
-
### S2TT, T2TT and ASR:
|
| 148 |
-
|
| 149 |
-
```python
|
| 150 |
-
# S2TT
|
| 151 |
-
translated_text, _, _ = translator.predict(<path_to_input_audio>, "s2tt", <tgt_lang>)
|
| 152 |
-
|
| 153 |
-
# ASR
|
| 154 |
-
# This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
|
| 155 |
-
transcribed_text, _, _ = translator.predict(<path_to_input_audio>, "asr", <src_lang>)
|
| 156 |
-
|
| 157 |
-
# T2TT
|
| 158 |
-
translated_text, _, _ = translator.predict(<input_text>, "t2tt", <tgt_lang>, src_lang=<src_lang>)
|
| 159 |
-
|
| 160 |
-
```
|
| 161 |
-
Note that `<src_lang>` must be specified for T2TT.
|
| 162 |
-
|
| 163 |
## Citation
|
| 164 |
|
| 165 |
If you plan to use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite:
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
license: cc-by-nc-4.0
|
| 3 |
+
language:
|
| 4 |
+
- af
|
| 5 |
+
- am
|
| 6 |
+
- ar
|
| 7 |
+
- as
|
| 8 |
+
- az
|
| 9 |
+
- be
|
| 10 |
+
- bn
|
| 11 |
+
- bs
|
| 12 |
+
- bg
|
| 13 |
+
- ca
|
| 14 |
+
- cs
|
| 15 |
+
- zh
|
| 16 |
+
- cy
|
| 17 |
+
- da
|
| 18 |
+
- de
|
| 19 |
+
- el
|
| 20 |
+
- en
|
| 21 |
+
- et
|
| 22 |
+
- fi
|
| 23 |
+
- fr
|
| 24 |
+
- or
|
| 25 |
+
- om
|
| 26 |
+
- ga
|
| 27 |
+
- gl
|
| 28 |
+
- gu
|
| 29 |
+
- ha
|
| 30 |
+
- he
|
| 31 |
+
- hi
|
| 32 |
+
- hr
|
| 33 |
+
- hu
|
| 34 |
+
- hy
|
| 35 |
+
- ig
|
| 36 |
+
- id
|
| 37 |
+
- is
|
| 38 |
+
- it
|
| 39 |
+
- jv
|
| 40 |
+
- ja
|
| 41 |
+
- kn
|
| 42 |
+
- ka
|
| 43 |
+
- kk
|
| 44 |
+
- mn
|
| 45 |
+
- km
|
| 46 |
+
- ky
|
| 47 |
+
- ko
|
| 48 |
+
- lo
|
| 49 |
+
- ln
|
| 50 |
+
- lt
|
| 51 |
+
- lb
|
| 52 |
+
- lg
|
| 53 |
+
- lv
|
| 54 |
+
- ml
|
| 55 |
+
- mr
|
| 56 |
+
- mk
|
| 57 |
+
- mt
|
| 58 |
+
- mi
|
| 59 |
+
- my
|
| 60 |
+
- nl
|
| 61 |
+
- nb
|
| 62 |
+
- ne
|
| 63 |
+
- ny
|
| 64 |
+
- oc
|
| 65 |
+
- pa
|
| 66 |
+
- ps
|
| 67 |
+
- fa
|
| 68 |
+
- pl
|
| 69 |
+
- pt
|
| 70 |
+
- ro
|
| 71 |
+
- ru
|
| 72 |
+
- sk
|
| 73 |
+
- sl
|
| 74 |
+
- sn
|
| 75 |
+
- sd
|
| 76 |
+
- so
|
| 77 |
+
- es
|
| 78 |
+
- sr
|
| 79 |
+
- sv
|
| 80 |
+
- sw
|
| 81 |
+
- ta
|
| 82 |
+
- te
|
| 83 |
+
- tg
|
| 84 |
+
- tl
|
| 85 |
+
- th
|
| 86 |
+
- tr
|
| 87 |
+
- uk
|
| 88 |
+
- ur
|
| 89 |
+
- uz
|
| 90 |
+
- vi
|
| 91 |
+
- wo
|
| 92 |
+
- xh
|
| 93 |
+
- yo
|
| 94 |
+
- ms
|
| 95 |
+
- zu
|
| 96 |
+
- ary
|
| 97 |
+
- arz
|
| 98 |
+
- yue
|
| 99 |
+
- kea
|
| 100 |
+
metrics:
|
| 101 |
+
- bleu
|
| 102 |
+
- wer
|
| 103 |
+
- chrf
|
| 104 |
+
inference: False
|
| 105 |
+
pipeline_tag: automatic-speech-recognition
|
| 106 |
+
tags:
|
| 107 |
+
- audio-to-audio
|
| 108 |
+
- text-to-speech
|
| 109 |
+
- speech-to-text
|
| 110 |
+
- text2text-generation
|
| 111 |
+
- seamless_communication
|
| 112 |
library_name: fairseq2
|
| 113 |
---
|
|
|
|
| 114 |
# SeamlessM4T Medium
|
| 115 |
|
| 116 |
SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different
|
|
|
|
| 123 |
|
| 124 |
-------------------
|
| 125 |
|
| 126 |
+
**🌟 SeamlessM4T v2, an improved version of this version with a novel architecture, has been released [here](https://huggingface.co/facebook/seamless-m4t-v2-large).**
|
| 127 |
+
|
| 128 |
+
**This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks.**
|
| 129 |
|
| 130 |
**SeamlessM4T v2 is also supported by 🤗 Transformers, more on it [in the model card of this new version](https://huggingface.co/facebook/seamless-m4t-v2-large#transformers-usage) or directly in [🤗 Transformers docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2).**
|
| 131 |
|
| 132 |
-------------------
|
| 133 |
|
| 134 |
+
This is the "medium" variant of SeamlessM4T, which enables multiple tasks without relying on multiple separate models:
|
| 135 |
- Speech-to-speech translation (S2ST)
|
| 136 |
- Speech-to-text translation (S2TT)
|
| 137 |
- Text-to-speech translation (T2ST)
|
|
|
|
| 139 |
- Automatic speech recognition (ASR)
|
| 140 |
|
| 141 |
## SeamlessM4T models
|
|
|
|
| 142 |
| Model Name | #params | checkpoint | metrics |
|
| 143 |
| ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
|
| 144 |
+
| [SeamlessM4T-Large v2](https://huggingface.co/facebook/seamless-m4t-v2-large) | 2.3B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-v2-large/blob/main/seamlessM4T_v2_large.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large_v2.zip) |
|
| 145 |
+
| [SeamlessM4T-Large (v1)](https://huggingface.co/facebook/seamless-m4t-large) | 2.3B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/blob/main/multitask_unity_large.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large.zip) |
|
| 146 |
+
| [SeamlessM4T-Medium (v1)](https://huggingface.co/facebook/seamless-m4t-medium) | 1.2B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/blob/main/multitask_unity_medium.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_medium.zip) |
|
| 147 |
|
| 148 |
+
We provide extensive evaluation results of SeamlessM4T models in the [SeamlessM4T](https://arxiv.org/abs/2308.11596) and [Seamless](https://arxiv.org/abs/2312.05187) papers (as averages) in the `metrics` files above.
|
| 149 |
|
| 150 |
## 🤗 Transformers Usage
|
| 151 |
|
| 152 |
First, load the processor and a checkpoint of the model:
|
| 153 |
|
| 154 |
```python
|
| 155 |
+
import torchaudio
|
| 156 |
+
from transformers import AutoProcessor, SeamlessM4TModel
|
| 157 |
+
processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")
|
| 158 |
+
model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")
|
| 159 |
```
|
| 160 |
|
| 161 |
You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
|
|
|
|
| 163 |
Here is how to use the processor to process text and audio:
|
| 164 |
|
| 165 |
```python
|
| 166 |
+
# Read an audio file and resample to 16kHz:
|
| 167 |
+
audio, orig_freq = torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")
|
| 168 |
+
audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
|
| 169 |
+
audio_inputs = processor(audios=audio, return_tensors="pt")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
|
| 171 |
+
# Process some input text as well:
|
| 172 |
+
text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
|
| 173 |
+
```
|
| 174 |
|
| 175 |
### Speech
|
| 176 |
|
| 177 |
+
Generate speech in Russian from either text (T2ST) or speech input (S2ST):
|
| 178 |
|
| 179 |
```python
|
| 180 |
+
audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
|
| 181 |
+
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
|
| 182 |
```
|
| 183 |
|
|
|
|
|
|
|
| 184 |
### Text
|
| 185 |
|
| 186 |
+
Similarly, you can generate translated text from audio files (S2TT) or from text (T2TT, conventionally MT) with the same model.
|
| 187 |
+
You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate).
|
| 188 |
|
| 189 |
```python
|
| 190 |
+
# from audio
|
| 191 |
+
output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
|
| 192 |
+
translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 193 |
|
| 194 |
+
# from text
|
| 195 |
+
output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
|
| 196 |
+
translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
|
| 197 |
```
|
| 198 |
|
| 199 |
+
## Seamless_communication
|
| 200 |
+
You can also use the seamlessM4T models using the [`seamless_communication` library](https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/README.md)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
|
| 202 |
+
with either CLI:
|
| 203 |
+
```bash
|
| 204 |
+
m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_medium
|
|
|
|
|
|
|
|
|
|
| 205 |
```
|
| 206 |
+
or a `Translator` API:
|
| 207 |
+
```py
|
| 208 |
+
import torch
|
| 209 |
+
from seamless_communication.inference import Translator
|
| 210 |
|
| 211 |
+
# Initialize a Translator object with a multitask model, vocoder on the GPU.
|
| 212 |
+
translator = Translator("seamlessM4T_medium", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
|
| 213 |
+
text_output, speech_output = translator.predict(
|
| 214 |
+
input=<path_to_input_audio>,
|
| 215 |
+
task_str="S2ST",
|
| 216 |
+
tgt_lang=<tgt_lang>,
|
| 217 |
+
text_generation_opts=text_generation_opts,
|
| 218 |
+
unit_generation_opts=unit_generation_opts
|
|
|
|
|
|
|
|
|
|
|
|
|
| 219 |
)
|
| 220 |
```
|
| 221 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
## Citation
|
| 223 |
|
| 224 |
If you plan to use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite:
|