A toy DNA Bert based on ModernBert

This is a small (5.3 million parameter) DNA language model trained on coding sequences β€” the parts of DNA that are transcribed to RNA and then translated into proteins β€” from 13 vertebrate species. The tokenizer operates at the single-base level and includes 20 tokens: the standard bases G, C, T, and A, as well as tokens for missing or uncertain bases. The tokenizer follows the full FASTA file format specification.

In initial training on a MacBook over approximately 50 million tokens, the model achieved a loss of 1.12, which corresponds to a token probability of exp(βˆ’1.12) β‰ˆ 32.6%. The DNA used for training is overwhelmingly composed of G, C, T, and A, with only a very small proportion of unknown bases (N) and gaps (βˆ’). Assuming near-complete G, C, T, and A content, random guessing would yield about 25% accuracy, so 32.6% reflects meaningful learning.

However, it's important to note that simple biological rules constrain vertebrate coding sequences. For example, the first three bases of any coding sequence in this dataset must form the start codon (ATG), and no three-base sequence after that should form a stop codon until the end of the coding region. Furthermore, there are species-specific preferences for synonymous codons. In other words, even a simple rule-based script could predict bases in a coding sequence with accuracy above 25%.

Downloads last month
19
Safetensors
Model size
5.32M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train MichelNivard/DNABert-CDS-13Species-v0.1