Transformers documentation
SeamlessM4T-v2
This model was released on 2023-12-08 and added to Hugging Face Transformers on 2023-11-30.
SeamlessM4T-v2
Overview
The SeamlessM4T-v2 model was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.
SeamlessM4T-v2 is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. It is an improvement on the previous version. For more details on the differences between v1 and v2, refer to section Difference with SeamlessM4T-v1.
SeamlessM4T-v2 enables multiple tasks without relying on separate models:
- Speech-to-speech translation (S2ST)
- Speech-to-text translation (S2TT)
- Text-to-speech translation (T2ST)
- Text-to-text translation (T2TT)
- Automatic speech recognition (ASR)
SeamlessM4Tv2Model can perform all the above tasks, but each task also has its own dedicated sub-model.
The abstract from the paper is the following:
Recent advancements in automatic speech translation have dramatically expanded language coverage, improved multimodal capabilities, and enabled a wide range of tasks and functionalities. That said, large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model—SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. The expanded version of SeamlessAlign adds 114,800 hours of automatically aligned data for a total of 76 languages. SeamlessM4T v2 provides the foundation on which our two newest models, SeamlessExpressive and SeamlessStreaming, are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one’s voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention (EMMA) mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To understand the performance of these models, we combined novel and modified versions of existing automatic metrics to evaluate prosody, latency, and robustness. For human evaluations, we adapted existing protocols tailored for measuring the most relevant attributes in the preservation of meaning, naturalness, and expressivity. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. In sum, Seamless gives us a pivotal look at the technical foundation needed to turn the Universal Speech Translator from a science fiction concept into a real-world technology. Finally, contributions in this work—including models, code, and a watermark detector—are publicly released and accessible at the link below.
Usage
In the following example, we’ll load an Arabic audio sample and an English text sample and convert them into Russian speech and French text.
First, load the processor and a checkpoint of the model:
>>> from transformers import AutoProcessor, SeamlessM4Tv2Model
>>> processor = AutoProcessor.from_pretrained("facebook/seamless-m4t-v2-large")
>>> model = SeamlessM4Tv2Model.from_pretrained("facebook/seamless-m4t-v2-large")You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
Here is how to use the processor to process text and audio:
>>> # let's load an audio sample from an Arabic speech corpus
>>> from datasets import load_dataset
>>> dataset = load_dataset("halabi2016/arabic_speech_corpus", split="test", streaming=True)
>>> audio_sample = next(iter(dataset))["audio"]
>>> # now, process it
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
>>> # now, process some English text as well
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")Speech
SeamlessM4Tv2Model can seamlessly generate text or speech with few or no changes. Let’s target Russian voice translation:
>>> audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
>>> audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()With basically the same code, I’ve translated English text and Arabic speech to Russian speech samples.
Text
Similarly, you can generate translated text from audio files or from text with the same model. You only have to pass generate_speech=False to SeamlessM4Tv2Model.generate().
This time, let’s translate to French.
>>> # from audio
>>> output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
>>> # from text
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)Tips
1. Use dedicated models
SeamlessM4Tv2Model is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint. For example, you can replace the audio-to-audio generation snippet with the model dedicated to the S2ST task, the rest is exactly the same code:
>>> from transformers import SeamlessM4Tv2ForSpeechToSpeech
>>> model = SeamlessM4Tv2ForSpeechToSpeech.from_pretrained("facebook/seamless-m4t-v2-large")Or you can replace the text-to-text generation snippet with the model dedicated to the T2TT task, you only have to remove generate_speech=False.
>>> from transformers import SeamlessM4Tv2ForTextToText
>>> model = SeamlessM4Tv2ForTextToText.from_pretrained("facebook/seamless-m4t-v2-large")Feel free to try out SeamlessM4Tv2ForSpeechToText and SeamlessM4Tv2ForTextToSpeech as well.
2. Change the speaker identity
You have the possibility to change the speaker used for speech synthesis with the speaker_id argument. Some speaker_id works better than other for some languages!
3. Change the generation strategy
You can use different generation strategies for text generation, e.g .generate(input_ids=input_ids, text_num_beams=4, text_do_sample=True) which will perform multinomial beam-search decoding on the text model. Note that speech generation only supports greedy - by default - or multinomial sampling, which can be used with e.g. .generate(..., speech_do_sample=True, speech_temperature=0.6).
4. Generate speech and text at the same time
Use return_intermediate_token_ids=True with SeamlessM4Tv2Model to return both speech and text !
Model architecture
SeamlessM4T-v2 features a versatile architecture that smoothly handles the sequential generation of text and speech. This setup comprises two sequence-to-sequence (seq2seq) models. The first model translates the input modality into translated text, while the second model generates speech tokens, known as “unit tokens,” from the translated text.
Each modality has its own dedicated encoder with a unique architecture. Additionally, for speech output, a vocoder inspired by the HiFi-GAN architecture is placed on top of the second seq2seq model.
Difference with SeamlessM4T-v1
The architecture of this new version differs from the first in a few aspects:
Improvements on the second-pass model
The second seq2seq model, named text-to-unit model, is now non-auto regressive, meaning that it computes units in a single forward pass. This achievement is made possible by:
- the use of character-level embeddings, meaning that each character of the predicted translated text has its own embeddings, which are then used to predict the unit tokens.
- the use of an intermediate duration predictor, that predicts speech duration at the character-level on the predicted translated text.
- the use of a new text-to-unit decoder mixing convolutions and self-attention to handle longer context.
Difference in the speech encoder
The speech encoder, which is used during the first-pass generation process to predict the translated text, differs mainly from the previous speech encoder through these mechanisms:
- the use of chunked attention mask to prevent attention across chunks, ensuring that each position attends only to positions within its own chunk and a fixed number of previous chunks.
- the use of relative position embeddings which only considers distance between sequence elements rather than absolute positions. Please refer to Self-Attentionwith Relative Position Representations (Shaw et al.) for more details.
- the use of a causal depth-wise convolution instead of a non-causal one.
Generation process
Here’s how the generation process works:
- Input text or speech is processed through its specific encoder.
- A decoder creates text tokens in the desired language.
- If speech generation is required, the second seq2seq model, generates unit tokens in an non auto-regressive way.
- These unit tokens are then passed through the final vocoder to produce the actual speech.
This model was contributed by ylacombe. The original code can be found here.
SeamlessM4Tv2Model
class transformers.SeamlessM4Tv2Model
< source >( config current_modality = 'text' )
Parameters
- config (SeamlessM4Tv2Model) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
-  current_modality (str, optional, defaults to"text") — Default modality. Used to initialize the model.
The original SeamlessM4Tv2 Model transformer which can be used for every tasks available (S2ST, S2TT, T2TT, T2ST).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
generate
< source >( input_ids: typing.Optional[torch.Tensor] = None input_features: typing.Optional[torch.Tensor] = None return_intermediate_token_ids: typing.Optional[bool] = None tgt_lang: typing.Optional[str] = None speaker_id: typing.Optional[int] = 0 generate_speech: typing.Optional[bool] = True **kwargs  ) → Union[SeamlessM4Tv2GenerationOutput, tuple[Tensor], ModelOutput]
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  input_features (torch.FloatTensorof shape(batch_size, sequence_length, num_banks), optional) — Input audio features. This should be returned by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details.
-  return_intermediate_token_ids (bool, optional) — IfTrue, also returns the intermediate generated text and unit tokens. Set toTrueif you also want to get translated text alongside the audio. Note that ifgenerate_speech=False, this parameter will be ignored and the text tokens are returned.
-  tgt_lang (str, optional) — The language to use as target language for translation.
-  speaker_id (int, optional, defaults to 0) — The id of the speaker used for speech synthesis. Must be lower thanconfig.vocoder_num_spkrs.
-  generate_speech (bool, optional, defaults toTrue) — IfFalse, will only returns the text tokens and won’t generate speech.
-  kwargs (optional) —
Remaining dictioy of keyword arguments that will be passed to GenerationMixin.generate(). Keyword
arguments are of two types:
- Without a prefix, they will be entered as **kwargsfor thegeneratemethod of each sub-model, except fordecoder_input_idswhich will only be passed through the text components.
- With a text_ or speech_ prefix, they will be input for the generatemethod of the text model and speech model respectively. It has the priority over the keywords without a prefix.
 This means you can, for example, specify a generation strategy for one generation but not for the other. 
- Without a prefix, they will be entered as 
Returns
Union[SeamlessM4Tv2GenerationOutput, tuple[Tensor], ModelOutput]
- If generate_speechandreturn_intermediate_token_ids, returnsSeamlessM4Tv2GenerationOutput.
- If generate_speechand notreturn_intermediate_token_ids, returns a tuple composed of waveforms of shape(batch_size, sequence_length)andwaveform_lengthswhich gives the length of each sample.
- If generate_speech=False, it will returnsModelOutput.
Generates translated token ids and/or translated audio waveforms.
This method successively calls the
.generatefunction of two different sub-models. You can specify keyword arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments that will be passed to one of them.For example, calling
.generate(input_ids=input_ids, num_beams=4, speech_do_sample=True)will successively perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.For an overview of generation strategies and code examples, check out the following guide.
SeamlessM4Tv2ForTextToSpeech
class transformers.SeamlessM4Tv2ForTextToSpeech
< source >( config: SeamlessM4Tv2Config )
Parameters
- config (SeamlessM4Tv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The text-to-speech SeamlessM4Tv2 Model transformer which can be used for T2ST.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
generate
< source >( input_ids: typing.Optional[torch.Tensor] = None return_intermediate_token_ids: typing.Optional[bool] = None tgt_lang: typing.Optional[str] = None speaker_id: typing.Optional[int] = 0 **kwargs  ) → Union[SeamlessM4Tv2GenerationOutput, tuple[Tensor]]
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  return_intermediate_token_ids (bool, optional) — IfTrue, also returns the intermediate generated text and unit tokens. Set toTrueif you also want to get translated text alongside the audio.
-  tgt_lang (str, optional) — The language to use as target language for translation.
-  speaker_id (int, optional, defaults to 0) — The id of the speaker used for speech synthesis. Must be lower thanconfig.vocoder_num_spkrs.
-  kwargs (optional) —
Remaining dictionary of keyword arguments that will be passed to GenerationMixin.generate(). Keyword
arguments are of two types:
- Without a prefix, they will be entered as **kwargsfor thegeneratemethod of each sub-model, except fordecoder_input_idswhich will only be passed through the text components.
- With a text_ or speech_ prefix, they will be input for the generatemethod of the text model and speech model respectively. It has the priority over the keywords without a prefix.
 This means you can, for example, specify a generation strategy for one generation but not for the other. 
- Without a prefix, they will be entered as 
Returns
Union[SeamlessM4Tv2GenerationOutput, tuple[Tensor]]
- If return_intermediate_token_ids, returnsSeamlessM4Tv2GenerationOutput.
- If not return_intermediate_token_ids, returns a tuple composed of waveforms of shape(batch_size, sequence_length)andwaveform_lengthswhich gives the length of each sample.
Generates translated audio waveforms.
This method successively calls the
.generatefunction of two different sub-models. You can specify keyword arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments that will be passed to one of them.For example, calling
.generate(input_ids, num_beams=4, speech_do_sample=True)will successively perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.For an overview of generation strategies and code examples, check out the following guide.
SeamlessM4Tv2ForSpeechToSpeech
class transformers.SeamlessM4Tv2ForSpeechToSpeech
< source >( config )
Parameters
- config (SeamlessM4Tv2ForSpeechToSpeech) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The speech-to-speech SeamlessM4Tv2 Model transformer which can be used for S2ST.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
generate
< source >( input_features: typing.Optional[torch.Tensor] = None return_intermediate_token_ids: typing.Optional[bool] = None tgt_lang: typing.Optional[str] = None speaker_id: typing.Optional[int] = 0 **kwargs  ) → Union[SeamlessM4Tv2GenerationOutput, tuple[Tensor]]
Parameters
-  input_features (torch.FloatTensorof shape(batch_size, sequence_length, num_banks)) — Input audio features. This should be returned by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details.
-  return_intermediate_token_ids (bool, optional) — IfTrue, also returns the intermediate generated text and unit tokens. Set toTrueif you also want to get translated text alongside the audio.
-  tgt_lang (str, optional) — The language to use as target language for translation.
-  speaker_id (int, optional, defaults to 0) — The id of the speaker used for speech synthesis. Must be lower thanconfig.vocoder_num_spkrs.
-  kwargs (optional) —
Remaining dictionary of keyword arguments that will be passed to GenerationMixin.generate(). Keyword
arguments are of two types:
- Without a prefix, they will be entered as **kwargsfor thegeneratemethod of each sub-model, except fordecoder_input_idswhich will only be passed through the text components.
- With a text_ or speech_ prefix, they will be input for the generatemethod of the text model and speech model respectively. It has the priority over the keywords without a prefix.
 This means you can, for example, specify a generation strategy for one generation but not for the other. 
- Without a prefix, they will be entered as 
Returns
Union[SeamlessM4Tv2GenerationOutput, tuple[Tensor]]
- If return_intermediate_token_ids, returnsSeamlessM4Tv2GenerationOutput.
- If not return_intermediate_token_ids, returns a tuple composed of waveforms of shape(batch_size, sequence_length)andwaveform_lengthswhich gives the length of each sample.
Generates translated audio waveforms.
This method successively calls the
.generatefunction of two different sub-models. You can specify keyword arguments at two different levels: general arguments that will be passed to both models, or prefixed arguments that will be passed to one of them.For example, calling
.generate(input_features, num_beams=4, speech_do_sample=True)will successively perform beam-search decoding on the text model, and multinomial beam-search sampling on the speech model.For an overview of generation strategies and code examples, check out the following guide.
SeamlessM4Tv2ForTextToText
class transformers.SeamlessM4Tv2ForTextToText
< source >( config: SeamlessM4Tv2Config )
Parameters
- config (SeamlessM4Tv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The text-to-text SeamlessM4Tv2 Model transformer which can be used for T2TT.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None encoder_outputs: typing.Optional[tuple[tuple[torch.FloatTensor]]] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None **kwargs  ) → transformers.modeling_outputs.Seq2SeqLMOutput or tuple(torch.FloatTensor)
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
-  decoder_input_ids (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. Bart uses the eos_token_idas the starting token fordecoder_input_idsgeneration. Ifpast_key_valuesis used, optionally only the lastdecoder_input_idshave to be input (seepast_key_values).For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
-  decoder_attention_mask (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should read modeling_bart._prepare_decoder_attention_maskand modify to your needs. See diagram 1 in the paper for more information on the default strategy.
-  encoder_outputs (tuple[tuple[torch.FloatTensor]], optional) — Tuple consists of (last_hidden_state, optional:hidden_states, optional:attentions)last_hidden_stateof shape(batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
-  past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input. If past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length).
-  inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix.
-  decoder_inputs_embeds (torch.FloatTensorof shape(batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passingdecoder_input_idsyou can choose to directly pass an embedded representation. Ifpast_key_valuesis used, optionally only the lastdecoder_inputs_embedshave to be input (seepast_key_values). This is useful if you want more control over how to convertdecoder_input_idsindices into associated vectors than the model’s internal embedding lookup matrix.If decoder_input_idsanddecoder_inputs_embedsare both unset,decoder_inputs_embedstakes the value ofinputs_embeds.
-  labels (torch.LongTensorof shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should be in[-100, 0, ..., config.vocab_size](seeinput_idsdocstring) Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.Seq2SeqLMOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqLMOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (SeamlessM4Tv2Config) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss.
- 
logits ( torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
past_key_values ( EncoderDecoderCache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a EncoderDecoderCache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The SeamlessM4Tv2ForTextToText forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
generate
< source >( input_ids = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs  ) → ModelOutput or torch.LongTensor
Parameters
-  input_ids (torch.Tensorof varying shape depending on the modality, optional) — Indices of input sequence tokens in the vocabulary.Indices can be obtained using SeamlessM4TTokenizer or SeamlessM4TProcessor. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  tgt_lang (str, optional) — The language to use as target language for translation.
-  generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call.**kwargspassed to generate matching the attributes ofgeneration_configwill override them. Ifgeneration_configis not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.jsonmodel file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation.
-  logits_processor (LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
-  stopping_criteria (StoppingCriteriaList, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
-  prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], list[int]], optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch IDbatch_idandinput_ids. It has to return a list with the allowed tokens for the next generation step conditioned on the batch IDbatch_idand the previously generated tokensinputs_ids. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval.
-  synced_gpus (bool, optional, defaults toFalse) — Whether to continue running the while loop until max_length (needed to avoid deadlocking withFullyShardedDataParalleland DeepSpeed ZeRO Stage 3).
-  kwargs (dict[str, Any], optional) — Ad hoc parametrization ofgenerate_configand/or additional model-specific kwargs that will be forwarded to theforwardfunction of the model.
Returns
ModelOutput or torch.LongTensor
A ModelOutput (if return_dict_in_generate=True
or when config.return_dict_in_generate=True) or a torch.FloatTensor. The possible
ModelOutput types are:
Generates sequences of token ids.
Most generation-controlling parameters are set in
generation_configwhich, if not passed, will be set to the model’s default generation configuration. You can override anygeneration_configby passing the corresponding parameters to generate(), e.g..generate(inputs, num_beams=4, do_sample=True).For an overview of generation strategies and code examples, check out the following guide.
SeamlessM4Tv2ForSpeechToText
class transformers.SeamlessM4Tv2ForSpeechToText
< source >( config: SeamlessM4Tv2Config )
Parameters
- config (SeamlessM4Tv2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The speech-to-text SeamlessM4Tv2 Model transformer which can be used for S2TT.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_features: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.LongTensor] = None encoder_outputs: typing.Optional[tuple[tuple[torch.FloatTensor]]] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None **kwargs  ) → transformers.modeling_outputs.Seq2SeqLMOutput or tuple(torch.FloatTensor)
Parameters
-  input_features (torch.FloatTensorof shape(batch_size, sequence_length, num_banks)) — Input audio features. This should be returned by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details.
-  attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
-  decoder_input_ids (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. Bart uses the eos_token_idas the starting token fordecoder_input_idsgeneration. Ifpast_key_valuesis used, optionally only the lastdecoder_input_idshave to be input (seepast_key_values).For translation and summarization training, decoder_input_idsshould be provided. If nodecoder_input_idsis provided, the model will create this tensor by shifting theinput_idsto the right for denoising pre-training following the paper.
-  decoder_attention_mask (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.If you want to change padding behavior, you should read modeling_bart._prepare_decoder_attention_maskand modify to your needs. See diagram 1 in the paper for more information on the default strategy.
-  encoder_outputs (tuple[tuple[torch.FloatTensor]], optional) — Tuple consists of (last_hidden_state, optional:hidden_states, optional:attentions)last_hidden_stateof shape(batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
-  past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input. If past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length).
-  inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix.
-  decoder_inputs_embeds (torch.FloatTensorof shape(batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passingdecoder_input_idsyou can choose to directly pass an embedded representation. Ifpast_key_valuesis used, optionally only the lastdecoder_inputs_embedshave to be input (seepast_key_values). This is useful if you want more control over how to convertdecoder_input_idsindices into associated vectors than the model’s internal embedding lookup matrix.If decoder_input_idsanddecoder_inputs_embedsare both unset,decoder_inputs_embedstakes the value ofinputs_embeds.
-  labels (torch.LongTensorof shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should be in[-100, 0, ..., config.vocab_size](seeinput_idsdocstring) Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.Seq2SeqLMOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqLMOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (SeamlessM4Tv2Config) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss.
- 
logits ( torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
past_key_values ( EncoderDecoderCache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a EncoderDecoderCache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The SeamlessM4Tv2ForSpeechToText forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, SeamlessM4Tv2ForSpeechToText
>>> from datasets import load_dataset
>>> import torch
>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate
>>> processor = AutoProcessor.from_pretrained("""")
>>> model = SeamlessM4Tv2ForSpeechToText.from_pretrained("""")
>>> # audio file is decoded on the fly
>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
>>> with torch.no_grad():
...     logits = model(**inputs).logits
>>> predicted_ids = torch.argmax(logits, dim=-1)
>>> # transcribe speech
>>> transcription = processor.batch_decode(predicted_ids)
>>> transcription[0]
...
>>> inputs["labels"] = processor(text=dataset[0]["text"], return_tensors="pt").input_ids
>>> # compute loss
>>> loss = model(**inputs).loss
>>> round(loss.item(), 2)
...generate
< source >( input_features = None tgt_lang = None generation_config = None logits_processor = None stopping_criteria = None prefix_allowed_tokens_fn = None synced_gpus = False **kwargs  ) → ModelOutput or torch.LongTensor
Parameters
-  input_features (torch.FloatTensorof shape(batch_size, sequence_length, num_banks)) — Input audio features. This should be returned by the SeamlessM4TFeatureExtractor class or the SeamlessM4TProcessor class. See SeamlessM4TFeatureExtractor.call() for details.
-  tgt_lang (str, optional) — The language to use as target language for translation.
-  generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call.**kwargspassed to generate matching the attributes ofgeneration_configwill override them. Ifgeneration_configis not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.jsonmodel file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation.
-  logits_processor (LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
-  stopping_criteria (StoppingCriteriaList, optional) — Custom stopping criteria that complement the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
-  prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], list[int]], optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch IDbatch_idandinput_ids. It has to return a list with the allowed tokens for the next generation step conditioned on the batch IDbatch_idand the previously generated tokensinputs_ids. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval.
-  synced_gpus (bool, optional, defaults toFalse) — Whether to continue running the while loop until max_length (needed to avoid deadlocking withFullyShardedDataParalleland DeepSpeed ZeRO Stage 3).
-  kwargs (dict[str, Any], optional) — Ad hoc parametrization ofgenerate_configand/or additional model-specific kwargs that will be forwarded to theforwardfunction of the model.
Returns
ModelOutput or torch.LongTensor
A ModelOutput (if return_dict_in_generate=True
or when config.return_dict_in_generate=True) or a torch.FloatTensor. The possible
ModelOutput types are:
Generates sequences of token ids.
Most generation-controlling parameters are set in
generation_configwhich, if not passed, will be set to the model’s default generation configuration. You can override anygeneration_configby passing the corresponding parameters to generate(), e.g..generate(inputs, num_beams=4, do_sample=True).For an overview of generation strategies and code examples, check out the following guide.
SeamlessM4Tv2Config
class transformers.SeamlessM4Tv2Config
< source >( vocab_size = 256102 t2u_vocab_size = 10082 char_vocab_size = 10943 hidden_size = 1024 initializer_range = 0.02 layer_norm_eps = 1e-05 use_cache = True max_position_embeddings = 4096 is_encoder_decoder = True encoder_layerdrop = 0.05 decoder_layerdrop = 0.05 activation_function = 'relu' dropout = 0.1 attention_dropout = 0.1 activation_dropout = 0.0 scale_embedding = True encoder_layers = 24 encoder_ffn_dim = 8192 encoder_attention_heads = 16 decoder_layers = 24 decoder_ffn_dim = 8192 decoder_attention_heads = 16 decoder_start_token_id = 3 max_new_tokens = 256 pad_token_id = 0 bos_token_id = 2 eos_token_id = 3 speech_encoder_layers = 24 speech_encoder_attention_heads = 16 speech_encoder_intermediate_size = 4096 speech_encoder_hidden_act = 'swish' speech_encoder_dropout = 0.0 add_adapter = True speech_encoder_layerdrop = 0.1 feature_projection_input_dim = 160 adaptor_kernel_size = 8 adaptor_stride = 8 adaptor_dropout = 0.1 num_adapter_layers = 1 position_embeddings_type = 'relative_key' conv_depthwise_kernel_size = 31 left_max_position_embeddings = 64 right_max_position_embeddings = 8 speech_encoder_chunk_size = 20000 speech_encoder_left_chunk_num = 128 t2u_bos_token_id = 0 t2u_pad_token_id = 1 t2u_eos_token_id = 2 t2u_encoder_layers = 6 t2u_encoder_ffn_dim = 8192 t2u_encoder_attention_heads = 16 t2u_decoder_layers = 6 t2u_decoder_ffn_dim = 8192 t2u_decoder_attention_heads = 16 t2u_max_position_embeddings = 4096 t2u_variance_predictor_embed_dim = 1024 t2u_variance_predictor_hidden_dim = 256 t2u_variance_predictor_kernel_size = 3 t2u_variance_pred_dropout = 0.5 sampling_rate = 16000 upsample_initial_channel = 512 upsample_rates = [5, 4, 4, 2, 2] upsample_kernel_sizes = [11, 8, 8, 4, 4] resblock_kernel_sizes = [3, 7, 11] resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]] leaky_relu_slope = 0.1 unit_hifi_gan_vocab_size = 10000 unit_embed_dim = 1280 lang_embed_dim = 256 spkr_embed_dim = 256 vocoder_num_langs = 36 vocoder_num_spkrs = 200 variance_predictor_kernel_size = 3 var_pred_dropout = 0.5 vocoder_offset = 4 **kwargs )
Parameters
-  vocab_size (int, optional, defaults to 256102) — Vocabulary size of the text modality of the SeamlessM4Tv2 model. Defines the number of different tokens that can be represented by theinputs_idspassed when calling ~SeamlessM4Tv2Model, ~SeamlessM4Tv2ForTextToSpeech or ~SeamlessM4Tv2ForTextToText.
-  t2u_vocab_size (int, optional, defaults to 10082) — Unit vocabulary size of the SeamlessM4Tv2 model. Defines the number of different “unit tokens” that can be represented by theinputs_idspassed when calling the Text-To-Units sub-model of ~SeamlessM4Tv2Model, ~SeamlessM4Tv2ForSpeechToSpeech or ~SeamlessM4Tv2ForTextToSpeech.
-  char_vocab_size (int, optional, defaults to 10943) — Character vocabulary size of the SeamlessM4Tv2 model. Defines the number of different character tokens that can be represented by thechar_inputs_idspassed when calling the Text-To-Units sub-model of ~SeamlessM4Tv2Model, ~SeamlessM4Tv2ForSpeechToSpeech or ~SeamlessM4Tv2ForTextToSpeech.
Parameters shared across sub-models
-  hidden_size (int, optional, defaults to 1024) — Dimensionality of the “intermediate” layers in the architecture.
-  initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
-  layer_norm_eps (float, optional, defaults to 1e-05) — The epsilon used by the layer normalization layers.
-  use_cache (bool, optional, defaults toTrue) — Whether or not the model should return the last key/values attentions (not used by all models).
-  max_position_embeddings (int, optional, defaults to 4096) — The maximum sequence length that this model text encoder and decoder might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-  is_encoder_decoder (bool, optional, defaults toTrue) — Whether the model is used as an encoder/decoder or not.
-  encoder_layerdrop (float, optional, defaults to 0.05) — The LayerDrop probability for the encoders. See the [LayerDrop paper](see https://huggingface.co/papers/1909.11556) for more details.
-  decoder_layerdrop (float, optional, defaults to 0.05) — The LayerDrop probability for the decoders. See the [LayerDrop paper](see https://huggingface.co/papers/1909.11556) for more details.
-  activation_function (strorfunction, optional, defaults to"relu") — The non-linear activation function (function or string) in the decoder and feed-forward layers. If string,"gelu","relu","selu","swish"and"gelu_new"are supported.
-  dropout (float, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings, encoder, decoder, and pooler.
-  attention_dropout (float, optional, defaults to 0.1) — The dropout probability for all attention layers.
-  activation_dropout (float, optional, defaults to 0.0) — The dropout probability for all activation layers in the model.
-  scale_embedding (bool, optional, defaults toTrue) — Scale embeddings by diving by sqrt(d_model).
Text encoder and text decoder specific parameters
-  encoder_layers (int, optional, defaults to 24) — Number of hidden layers in the Transformer text encoder.
-  encoder_ffn_dim (int, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text encoder.
-  encoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text encoder.
-  decoder_layers (int, optional, defaults to 24) — Number of hidden layers in the Transformer text decoder.
-  decoder_ffn_dim (int, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text decoder.
-  decoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text decoder.
-  decoder_start_token_id (int, optional, defaults to 3) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token. Only applied in the text decoder.
-  max_new_tokens (int, optional, defaults to 256) — The maximum numbers of text tokens to generate, ignoring the number of tokens in the prompt.
-  pad_token_id (int, optional, defaults to 0) — The id of the padding text token. Only applied to the text-decoder model.
-  bos_token_id (int, optional, defaults to 2) — The id of the beginning-of-stream text token. Only applied to the text-decoder model.
-  eos_token_id (int, optional, defaults to 3) — The id of the end-of-stream text token. Only applied to the text-decoder model.
Speech encoder specific parameters
-  speech_encoder_layers (int, optional, defaults to 24) — Number of hidden layers in the Transformer speech encoder.
-  speech_encoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer speech encoder.
-  speech_encoder_intermediate_size (int, optional, defaults to 4096) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer speech encoder.
-  speech_encoder_hidden_act (strorfunction, optional, defaults to"swish") — The non-linear activation function (function or string) in the speech encoder. If string,"gelu","relu","selu","swish"and"gelu_new"are supported.
-  speech_encoder_dropout (float, optional, defaults to 0.0) — The dropout probability for all layers in the speech encoder.
-  add_adapter (bool, optional, defaults toTrue) — Add an adapter layer on top of the speech encoder.
-  speech_encoder_layerdrop (float, optional, defaults to 0.1) — The LayerDrop probability for the speech encoder. See the [LayerDrop paper](see https://huggingface.co/papers/1909.11556) for more details.
-  feature_projection_input_dim (int, optional, defaults to 160) — Input dimension of the input feature projection of the speech encoder, i.e the dimension after processing input audios with SeamlessM4TFeatureExtractor.
-  adaptor_kernel_size (int, optional, defaults to 8) — Kernel size of the convolutional layers in the adapter network. Only relevant ifadd_adapter is True.
-  adaptor_stride (int, optional, defaults to 8) — Stride of the convolutional layers in the adapter network. Only relevant ifadd_adapter is True.
-  adaptor_dropout (float, optional, defaults to 0.1) — The dropout probability for all layers in the speech adapter.
-  num_adapter_layers (int, optional, defaults to 1) — Number of convolutional layers that should be used in the adapter network. Only relevant ifadd_adapter is True.
-  position_embeddings_type (str, optional, defaults to"relative_key") — Can be specified torelative_key. If left toNone, no relative position embedding is applied. Only applied to the speech encoder. For more information on"relative_key", please refer to Self-Attention with Relative Position Representations (Shaw et al.).
-  conv_depthwise_kernel_size (int, optional, defaults to 31) — Kernel size of convolutional depthwise 1D layer in Conformer blocks. Only applied to the speech encoder.
-  left_max_position_embeddings (int, optional, defaults to 64) — The left clipping value for relative positions.
-  right_max_position_embeddings (int, optional, defaults to 8) — The right clipping value for relative positions.
-  speech_encoder_chunk_size (int, optional, defaults to 20000) — The size of each attention chunk.
-  speech_encoder_left_chunk_num (int, optional, defaults to 128) — Number of chunks on the left up to which lookahead is allowed.
Text-To-Unit (t2u) model specific parameters
-  t2u_bos_token_id (int, optional, defaults to 0) — The id of the beginning-of-stream unit token. Only applied to the text-to-unit seq2seq model.
-  t2u_pad_token_id (int, optional, defaults to 1) — The id of the padding unit token. Only applied to the text-to-unit seq2seq model.
-  t2u_eos_token_id (int, optional, defaults to 2) — The id of the end-of-stream unit token. Only applied to the text-to-unit seq2seq model.
-  t2u_encoder_layers (int, optional, defaults to 6) — Number of hidden layers in the Transformer text-to-unit encoder.
-  t2u_encoder_ffn_dim (int, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text-to-unit encoder.
-  t2u_encoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text-to-unit encoder.
-  t2u_decoder_layers (int, optional, defaults to 6) — Number of hidden layers in the Transformer text-to-unit decoder.
-  t2u_decoder_ffn_dim (int, optional, defaults to 8192) — Dimension of the “intermediate” (i.e., feed-forward) layer in the Transformer text-to-unit decoder.
-  t2u_decoder_attention_heads (int, optional, defaults to 16) — Number of attention heads for each attention layer in the Transformer text-to-unit decoder.
-  t2u_max_position_embeddings (int, optional, defaults to 4096) — The maximum sequence length that this model text-to-unit component might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
-  t2u_variance_predictor_embed_dim (int, optional, defaults to 1024) — The projection dimension of the text-to-unit’s duration predictor.
-  t2u_variance_predictor_hidden_dim (int, optional, defaults to 256) — Internal dimension of the text-to-unit’s duration predictor.
-  t2u_variance_predictor_kernel_size (int, optional, defaults to 3) — Kernel size of the convolutional layers of the text-to-unit’s duration predictor.
-  t2u_variance_pred_dropout (float, optional, defaults to 0.5) — The dropout probability of the text-to-unit’s duration predictor.Hifi-Gan Vocoder specific parameters 
-  sampling_rate (int, optional, defaults to 16000) — The sampling rate at which the output audio will be generated, expressed in hertz (Hz).
-  upsample_initial_channel (int, optional, defaults to 512) — The number of input channels into the hifi-gan upsampling network. Applies to the vocoder only.
-  upsample_rates (tuple[int]orlist[int], optional, defaults to[5, 4, 4, 2, 2]) — A tuple of integers defining the stride of each 1D convolutional layer in the vocoder upsampling network. The length of upsample_rates defines the number of convolutional layers and has to match the length of upsample_kernel_sizes. Applies to the vocoder only.
-  upsample_kernel_sizes (tuple[int]orlist[int], optional, defaults to[11, 8, 8, 4, 4]) — A tuple of integers defining the kernel size of each 1D convolutional layer in the vocoder upsampling network. The length of upsample_kernel_sizes defines the number of convolutional layers and has to match the length of upsample_rates. Applies to the vocoder only.
-  resblock_kernel_sizes (tuple[int]orlist[int], optional, defaults to[3, 7, 11]) — A tuple of integers defining the kernel sizes of the vocoder 1D convolutional layers in the multi-receptive field fusion (MRF) module. Applies to the vocoder only.
-  resblock_dilation_sizes (tuple[tuple[int]]orlist[list[int]], optional, defaults to[[1, 3, 5], [1, 3, 5], [1, 3, 5]]) — A nested tuple of integers defining the dilation rates of the vocoder dilated 1D convolutional layers in the multi-receptive field fusion (MRF) module. Applies to the vocoder only.
-  leaky_relu_slope (float, optional, defaults to 0.1) — The angle of the negative slope used by the leaky ReLU activation in the vocoder. Applies to the vocoder only.
-  unit_hifi_gan_vocab_size (int, optional, defaults to 10000) — Vocabulary size of the SeamlessM4Tv2 vocoder. Defines the number of different unit tokens that can be represented by theinputs_idspassed when calling the vocoder of ~SeamlessM4Tv2Model, ~SeamlessM4Tv2ForSpeechToSpeech or ~SeamlessM4Tv2ForTextToSpeech.
-  unit_embed_dim (int, optional, defaults to 1280) — The projection dimension of the input ids given to the hifi-gan vocoder. Applies to the vocoder only.
-  lang_embed_dim (int, optional, defaults to 256) — The projection dimension of the target language given to the hifi-gan vocoder. Applies to the vocoder only.
-  spkr_embed_dim (int, optional, defaults to 256) — The projection dimension of the speaker id given to the hifi-gan vocoder. Applies to the vocoder only.
-  vocoder_num_langs (int, optional, defaults to 36) — Number of langs supported by the vocoder. Might be different fromt2u_num_langs.
-  vocoder_num_spkrs (int, optional, defaults to 200) — Number of speakers supported by the vocoder.
-  variance_predictor_kernel_size (int, optional, defaults to 3) — Kernel size of the duration predictor. Applies to the vocoder only.
-  var_pred_dropout (float, optional, defaults to 0.5) — The dropout probability of the duration predictor. Applies to the vocoder only.
-  vocoder_offset (int, optional, defaults to 4) — Offset the unit token ids by this number to account for symbol tokens. Applies to the vocoder only.
This is the configuration class to store the configuration of a ~SeamlessM4Tv2Model. It is used to instantiate an SeamlessM4Tv2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SeamlessM4Tv2 "" architecture.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
>>> from transformers import SeamlessM4Tv2Model, SeamlessM4Tv2Config
>>> # Initializing a SeamlessM4Tv2 "" style configuration
>>> configuration = SeamlessM4Tv2Config()
>>> # Initializing a model from the "" style configuration
>>> model = SeamlessM4Tv2Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config