tomatotomato-gLM2-150M-v0.1
TomatoTomato (pronounced "to-may-to, to-mah-to") is a finetune of gLM2_150M on a new technique in tokenizing pangenomes. Tokenizing two genomes at once means if one aligned sequence is AAAA and the other is ACGT, we output four tokens representing that variance. The base model is TattaBio's gLM2, repurposing and adding to its vocabulary.
The training data is one sequence representing the differences between two tomato genomes:
- Heinz 1706, NCBI's sequence GCF_000188115.5
- LA2093, a wild variant (Solanum pimpinellifolium) from the Sol Genomics Network
Using minimap2, I found ~20,000 aligned sequences between the two genomes.
After additional processing matching nucleotides are passed along as lowercase letters (a, c, g, t, n) as in the original gLM2. Conflicts between nucleotides, or one sequence being present in one and absent in the other, are represented by new tokens which v0.1 is representing with uppercase tokens from gLM2's vocabulary. To convert a MAF file to a single sequence, see this CoLab notebook.
Masked token example
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("monsoon-nlp/tomatotomato-gLM2-150M-v0.1", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("monsoon-nlp/tomatotomato-gLM2-150M-v0.1", trust_remote_code=True)
def replace_mask(text):
inputs = tokenizer(text, return_tensors="pt")
del inputs["token_type_ids"]
logits = model(**inputs).logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = logits[0, mask_token_index, :]
top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()
for token in top_3_tokens:
print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
replace_mask("gagIE<mask>IR")
Caveats in v0.1
Throwable tomatoes here: 🍅🍅🍅
- Training was about 1/8 of an epoch and took about 1+1/2 hours... trying to test out the concept first
- I should retrain using added tokens instead of replacing gLM2's existing uppercase letter tokens for amino acids
- I should have a script to help with the tokens
- If Evo2's approach continues to be useful, it's not clear that we need to avoid duplication in training data
Why pangenomes and LLMs?
Pangenome research is entering new fields and applications as the cost of sequencing multiple samples goes down. Genome or Genomic Language Models (gLMs) are still controversial in bioinformatics, but can parse DNA in an interesting way which could have future applications. To read more, see the Evo2 announcement and the article What Are We Learning from Plant Pangenomes?.
gLMs already are a form of pangenomics because they are trained on multiple related genomes
There's an assumption that gLMs learn variance by training on multiple genomes, in the same way that LLMs can learn similar sentences in English ("I got my shoes from a store" "I got my shoes at the mall"). But the four nucleotides make a very small vocabulary space. Picture an important passage of 1,000 word-tokens, followed by millions of words until reaching a second giant book, where there's a 1,000-word passage that's identical except for one difference. It's also not clear what's lost in repetition (some gLMs have done de-duplication) and in representing that variant in loss.
Pangenomics could help with imbalanced genomic data; for example if there are 1,000 rice genomes and only a few quinoa genomes circulating, you can convert them to two pangenomes instead of training on 300x as much rice text.
I can picture a situation where it's useful to output variance tokens, especially in a case like this where one genome is from a domesticated tomato and the other is from a wild relative. Basically expressing 'there are two ways to handle this' without messing with decoding.
Future ideas
I'm interested in speculative decoding as in Qwen3-Next, also with Apple's unexpected interest in folding as seen in SimpleFold (using a transformer without extra structural processing, arguing it's similar to a text-to-image process).
Adapting LitGPT / Metagene would be a good choice for pretraining from scratch.
I would like to set up an easy Evo2 environment, and try population-based loss (different loss for a 50–50 probability vs. 90–10).
For larger pangenomes, you would probably want a graph from vg giraffe ... I installed this, but was not prepared either to run it or to represent that as a single sequence for a tokenizer. For more on vg: https://github.com/vgteam/vg/wiki/Basic-Operations
- Downloads last month
- 18
Model tree for monsoon-nlp/tomatotomato-gLM2-150M-v0.1
Base model
tattabio/gLM2_150M