Transformers documentation
Encoder Decoder Models
This model was released on 2017-06-12 and added to Hugging Face Transformers on 2020-11-16.
Encoder Decoder Models
EncoderDecoderModel initializes a sequence-to-sequence model with any pretrained autoencoder and pretrained autoregressive model. It is effective for sequence generation tasks as demonstrated in Text Summarization with Pretrained Encoders which uses BertModel as the encoder and decoder.
This model was contributed by thomwolf.
Click on the Encoder Decoder models in the right sidebar for more examples of how to apply Encoder Decoder to different language tasks.
The example below demonstrates how to generate text with Pipeline, AutoModel, and from the command line.
from transformers import pipeline
summarizer = pipeline(
    "summarization",
    model="patrickvonplaten/bert2bert-cnn_dailymail-fp16",
    device=0
)
text = "Plants create energy through a process known as photosynthesis. This involves capturing sunlight and converting carbon dioxide and water into glucose and oxygen."
print(summarizer(text))Notes
- EncoderDecoderModel can be initialized using any pretrained encoder and decoder. But depending on the decoder architecture, the cross-attention layers may be randomly initialized.
These models require downstream fine-tuning, as discussed in this blog post. Use from_encoder_decoder_pretrained() to combine encoder and decoder checkpoints.
from transformers import EncoderDecoderModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    "google-bert/bert-base-uncased",
    "google-bert/bert-base-uncased"
)- Encoder Decoder models can be fine-tuned like BART, T5 or any other encoder-decoder model. Only 2 inputs are required to compute a loss, input_idsandlabels. Refer to this notebook for a more detailed training example.
>>> from transformers import BertTokenizer, EncoderDecoderModel
>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> model = EncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-uncased", "google-bert/bert-base-uncased")
>>> model.config.decoder_start_token_id = tokenizer.cls_token_id
>>> model.config.pad_token_id = tokenizer.pad_token_id
>>> input_ids = tokenizer(
...     "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side.During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was  finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft).Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.",
...     return_tensors="pt",
... ).input_ids
>>> labels = tokenizer(
...     "the eiffel tower surpassed the washington monument to become the tallest structure in the world. it was the first structure to reach a height of 300 metres in paris in 1930. it is now taller than the chrysler building by 5. 2 metres ( 17 ft ) and is the second tallest free - standing structure in paris.",
...     return_tensors="pt",
... ).input_ids
>>> # the forward function automatically creates the correct decoder_input_ids
>>> loss = model(input_ids=input_ids, labels=labels).loss- EncoderDecoderModel can be randomly initialized from an encoder and a decoder config as shown below.
>>> from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel
>>> config_encoder = BertConfig()
>>> config_decoder = BertConfig()
>>> config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
>>> model = EncoderDecoderModel(config=config)- The Encoder Decoder Model can also be used for translation as shown below.
from transformers import AutoTokenizer, EncoderDecoderModel
# Load a pre-trained translation model
model_name = "google/bert2bert_L-24_wmt_en_de"
tokenizer = AutoTokenizer.from_pretrained(model_name, pad_token="<pad>", eos_token="</s>", bos_token="<s>")
model = EncoderDecoderModel.from_pretrained(model_name)
# Input sentence to translate
input_text = "Plants create energy through a process known as"
# Encode the input text
inputs = tokenizer(input_text, return_tensors="pt", add_special_tokens=False).input_ids
# Generate the translated output
outputs = model.generate(inputs)[0]
# Decode the output tokens to get the translated sentence
translated_text = tokenizer.decode(outputs, skip_special_tokens=True)
print("Translated text:", translated_text)EncoderDecoderConfig
class transformers.EncoderDecoderConfig
< source >( **kwargs )
Parameters
-  kwargs (optional) —
Dictionary of keyword arguments. Notably:
- encoder (PretrainedConfig, optional) — An instance of a configuration object that defines the encoder config.
- decoder (PretrainedConfig, optional) — An instance of a configuration object that defines the decoder config.
 
EncoderDecoderConfig is the configuration class to store the configuration of a EncoderDecoderModel. It is used to instantiate an Encoder Decoder model according to the specified arguments, defining the encoder and decoder configs.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Examples:
>>> from transformers import BertConfig, EncoderDecoderConfig, EncoderDecoderModel
>>> # Initializing a BERT google-bert/bert-base-uncased style configuration
>>> config_encoder = BertConfig()
>>> config_decoder = BertConfig()
>>> config = EncoderDecoderConfig.from_encoder_decoder_configs(config_encoder, config_decoder)
>>> # Initializing a Bert2Bert model (with random weights) from the google-bert/bert-base-uncased style configurations
>>> model = EncoderDecoderModel(config=config)
>>> # Accessing the model configuration
>>> config_encoder = model.config.encoder
>>> config_decoder = model.config.decoder
>>> # set decoder config to causal lm
>>> config_decoder.is_decoder = True
>>> config_decoder.add_cross_attention = True
>>> # Saving the model, including its configuration
>>> model.save_pretrained("my-model")
>>> # loading model and config from pretrained folder
>>> encoder_decoder_config = EncoderDecoderConfig.from_pretrained("my-model")
>>> model = EncoderDecoderModel.from_pretrained("my-model", config=encoder_decoder_config)from_encoder_decoder_configs
< source >( encoder_config: PretrainedConfig decoder_config: PretrainedConfig **kwargs ) → EncoderDecoderConfig
Instantiate a EncoderDecoderConfig (or a derived class) from a pre-trained encoder model configuration and decoder model configuration.
EncoderDecoderModel
class transformers.EncoderDecoderModel
< source >( config: typing.Optional[transformers.configuration_utils.PretrainedConfig] = None encoder: typing.Optional[transformers.modeling_utils.PreTrainedModel] = None decoder: typing.Optional[transformers.modeling_utils.PreTrainedModel] = None )
Parameters
- config (PretrainedConfig, optional) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
-  encoder (PreTrainedModel, optional) — The encoder model to use.
-  decoder (PreTrainedModel, optional) — The decoder model to use.
The bare Encoder Decoder Model outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.FloatTensor] = None decoder_input_ids: typing.Optional[torch.LongTensor] = None decoder_attention_mask: typing.Optional[torch.BoolTensor] = None encoder_outputs: typing.Optional[tuple[torch.FloatTensor]] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None decoder_inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None **kwargs  ) → transformers.modeling_outputs.Seq2SeqLMOutput or tuple(torch.FloatTensor)
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  attention_mask (torch.FloatTensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 
-  decoder_input_ids (torch.LongTensorof shape(batch_size, target_sequence_length), optional) — Indices of decoder input sequence tokens in the vocabulary.Indices can be obtained using PreTrainedTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. If past_key_valuesis used, optionally only the lastdecoder_input_idshave to be input (seepast_key_values).For training, decoder_input_idsare automatically created by the model by shifting thelabelsto the right, replacing -100 by thepad_token_idand prepending them with thedecoder_start_token_id.
-  decoder_attention_mask (torch.BoolTensorof shape(batch_size, target_sequence_length), optional) — Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids. Causal mask will also be used by default.
-  encoder_outputs (tuple[torch.FloatTensor], optional) — Tuple consists of (last_hidden_state, optional:hidden_states, optional:attentions)last_hidden_stateof shape(batch_size, sequence_length, hidden_size), optional) is a sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
-  past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input. If past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length).
-  inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix.
-  decoder_inputs_embeds (torch.FloatTensorof shape(batch_size, target_sequence_length, hidden_size), optional) — Optionally, instead of passingdecoder_input_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertdecoder_input_idsindices into associated vectors than the model’s internal embedding lookup matrix.
-  labels (torch.LongTensorof shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss for the decoder. Indices should be in[-100, 0, ..., config.vocab_size](seeinput_idsdocstring) Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.Seq2SeqLMOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.Seq2SeqLMOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (EncoderDecoderConfig) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss.
- 
logits ( torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
past_key_values ( EncoderDecoderCache, optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) — It is a EncoderDecoderCache instance. For more details, see our kv cache guide.Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
decoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the decoder at the output of each layer plus the initial embedding outputs. 
- 
decoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
- 
cross_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. 
- 
encoder_last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder of the model.
- 
encoder_hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the encoder at the output of each layer plus the initial embedding outputs. 
- 
encoder_attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the self-attention heads. 
The EncoderDecoderModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
>>> from transformers import EncoderDecoderModel, BertTokenizer
>>> import torch
>>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
>>> model = EncoderDecoderModel.from_encoder_decoder_pretrained(
...     "google-bert/bert-base-uncased", "google-bert/bert-base-uncased"
... )  # initialize Bert2Bert from pre-trained checkpoints
>>> # training
>>> model.config.decoder_start_token_id = tokenizer.cls_token_id
>>> model.config.pad_token_id = tokenizer.pad_token_id
>>> model.config.vocab_size = model.config.decoder.vocab_size
>>> input_ids = tokenizer("This is a really long text", return_tensors="pt").input_ids
>>> labels = tokenizer("This is the corresponding summary", return_tensors="pt").input_ids
>>> outputs = model(input_ids=input_ids, labels=labels)
>>> loss, logits = outputs.loss, outputs.logits
>>> # save and load from pretrained
>>> model.save_pretrained("bert2bert")
>>> model = EncoderDecoderModel.from_pretrained("bert2bert")
>>> # generation
>>> generated = model.generate(input_ids)from_encoder_decoder_pretrained
< source >( encoder_pretrained_model_name_or_path: typing.Optional[str] = None decoder_pretrained_model_name_or_path: typing.Optional[str] = None *model_args **kwargs )
Parameters
-  encoder_pretrained_model_name_or_path (str, optional) — Information necessary to initiate the encoder. Can be either:- A string, the model id of a pretrained model hosted inside a model repo on huggingface.co.
- A path to a directory containing model weights saved using
save_pretrained(), e.g., ./my_model_directory/.
- A path or url to a tensorflow index checkpoint file (e.g, ./tf_model/model.ckpt.index). In this case,from_tfshould be set toTrueand a configuration object should be provided asconfigargument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
 
-  decoder_pretrained_model_name_or_path (str, optional, defaults toNone) — Information necessary to initiate the decoder. Can be either:- A string, the model id of a pretrained model hosted inside a model repo on huggingface.co.
- A path to a directory containing model weights saved using
save_pretrained(), e.g., ./my_model_directory/.
- A path or url to a tensorflow index checkpoint file (e.g, ./tf_model/model.ckpt.index). In this case,from_tfshould be set toTrueand a configuration object should be provided asconfigargument. This loading path is slower than converting the TensorFlow checkpoint in a PyTorch model using the provided conversion scripts and loading the PyTorch model afterwards.
 
-  model_args (remaining positional arguments, optional) —
All remaining positional arguments will be passed to the underlying model’s __init__method.
-  kwargs (remaining dictionary of keyword arguments, optional) —
Can be used to update the configuration object (after it being loaded) and initiate the model (e.g.,
output_attentions=True).- To update the encoder configuration, use the prefix encoder_ for each configuration parameter.
- To update the decoder configuration, use the prefix decoder_ for each configuration parameter.
- To update the parent model configuration, do not use a prefix for each configuration parameter.
 Behaves differently depending on whether a configis provided or automatically loaded.
Instantiate an encoder and a decoder from one or two base classes of the library from pretrained model checkpoints.
The model is set in evaluation mode by default using model.eval() (Dropout modules are deactivated). To train
the model, you need to first set it back in training mode with model.train().
Example:
>>> from transformers import EncoderDecoderModel
>>> # initialize a bert2bert from two pretrained BERT models. Note that the cross-attention layers will be randomly initialized
>>> model = EncoderDecoderModel.from_encoder_decoder_pretrained("google-bert/bert-base-uncased", "google-bert/bert-base-uncased")
>>> # saving model after fine-tuning
>>> model.save_pretrained("./bert2bert")
>>> # load fine-tuned model
>>> model = EncoderDecoderModel.from_pretrained("./bert2bert")