McClain/naive-dna-llama-6mer

Randomly initialized LLaMA model for DNA sequence generation using a custom 6-mer tokenizer (A,C,G,T) with stride 2.

This repository contains:

A custom tokenizer (tokenization_naive_dna_kmer.py) implementing 6-mer tokenization with stride 2
Tokenizer files with auto_map so AutoTokenizer.from_pretrained(..., trust_remote_code=True) works
A randomly initialized LlamaForCausalLM config and weights sized per the provided hyperparameters

Intended usage:

Research baseline for reinforcement learning from a completely untrained policy
Load with trust_remote_code=True so the custom tokenizer can be imported

Example:

from transformers import AutoTokenizer, AutoModelForCausalLM

repo = "McClain/naive-dna-llama-6mer"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True)

Notes:

Tokenizer vocabulary size is 4096 6-mers + 4 specials (BOS/EOS/PAD/UNK)
Decoding reconstructs the DNA string by overlapping the last 2 bases of successive k-mers
Model is untrained; it should be optimized purely via RL or other post-training methods

Downloads last month: 3

Safetensors

Model size

0.2B params

Tensor type

F32