Rostlab
/

prot_t5_xl_uniref50

text2text-generation

protein language model

text-generation-inference

Model card Files Files and versions

t03i commited on Jan 31, 2023

Commit

973be27

·

1 Parent(s): d604cdc

Fix code example

Files changed (1) hide show

README.md +18 -19

README.md CHANGED Viewed

@@ -1,5 +1,4 @@
 ---
-language: protein
 tags:
 - protein language model
 datasets:
@@ -38,29 +37,29 @@ We have also noticed that for feature extraction, its better to use the feature
 Here is how to use this model to extract the features of a given protein sequence in PyTorch:
 ```python
-from transformers import T5Tokenizer, T5Model
-import re
-import torch
-tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_uniref50', do_lower_case=False)
-model = T5Model.from_pretrained("Rostlab/prot_t5_xl_uniref50")
-sequences_Example = ["A E T C Z A O","S K T Z P"]
-sequences_Example = [re.sub(r"[UZOB]", "X", sequence) for sequence in sequences_Example]
-ids = tokenizer.batch_encode_plus(sequences_Example, add_special_tokens=True, padding=True)
-input_ids = torch.tensor(ids['input_ids'])
-attention_mask = torch.tensor(ids['attention_mask'])
-with torch.no_grad():
-    embedding = model(input_ids=input_ids,attention_mask=attention_mask,decoder_input_ids=None)
-# For feature extraction we recommend to use the encoder embedding
-encoder_embedding = embedding[2].cpu().numpy()
-decoder_embedding = embedding[0].cpu().numpy()
 ```
 ## Training data

 ---
 tags:
 - protein language model
 datasets:
 Here is how to use this model to extract the features of a given protein sequence in PyTorch:
 ```python
+sequence_examples = ["PRTEINO", "SEQWENCE"]
+# this will replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
+sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]
+# tokenize sequences and pad up to the longest sequence in the batch
+ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
+input_ids = torch.tensor(ids['input_ids']).to(device)
+attention_mask = torch.tensor(ids['attention_mask']).to(device)
+# generate embeddings
+with torch.no_grad():
+    embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)
+# extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7])
+emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
+print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
+# do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
+emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)
+# if you want to derive a single representation (per-protein embedding) for the whole protein
+emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)
+print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")
 ```
 ## Training data