A toy DNA Bert based on ModernBert

This is a small (5.3 million parameter) DNA language model trained on coding sequences — the parts of DNA that are transcribed to RNA and then translated into proteins — from 13 vertebrate species. The tokenizer operates at the single-base level and includes 20 tokens: the standard bases G, C, T, and A, as well as tokens for missing or uncertain bases. The tokenizer follows the full FASTA file format specification.

In initial training on a MacBook over approximately 50 million tokens, the model achieved a loss of 1.12, which corresponds to a token probability of exp(−1.12) ≈ 32.6%. The DNA used for training is overwhelmingly composed of G, C, T, and A, with only a very small proportion of unknown bases (N) and gaps (−). Assuming near-complete G, C, T, and A content, random guessing would yield about 25% accuracy, so 32.6% reflects meaningful learning.

However, it's important to note that simple biological rules constrain vertebrate coding sequences. For example, the first three bases of any coding sequence in this dataset must form the start codon (ATG), and no three-base sequence after that should form a stop codon until the end of the coding region. Furthermore, there are species-specific preferences for synonymous codons. In other words, even a simple rule-based script could predict bases in a coding sequence with accuracy above 25%.

Downloads last month: 19

Safetensors

Model size

5.32M params

Tensor type

F32

MichelNivard
/

DNABert-CDS-13Species-v0.1

A toy DNA Bert based on ModernBert

Dataset used to train MichelNivard/DNABert-CDS-13Species-v0.1