McClain/naive-dna-llama-6mer

Randomly initialized LLaMA model for DNA sequence generation using a custom 6-mer tokenizer (A,C,G,T) with stride 2.

This repository contains:

  • A custom tokenizer (tokenization_naive_dna_kmer.py) implementing 6-mer tokenization with stride 2
  • Tokenizer files with auto_map so AutoTokenizer.from_pretrained(..., trust_remote_code=True) works
  • A randomly initialized LlamaForCausalLM config and weights sized per the provided hyperparameters

Intended usage:

  • Research baseline for reinforcement learning from a completely untrained policy
  • Load with trust_remote_code=True so the custom tokenizer can be imported

Example:

from transformers import AutoTokenizer, AutoModelForCausalLM

repo = "McClain/naive-dna-llama-6mer"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True)

Notes:

  • Tokenizer vocabulary size is 4096 6-mers + 4 specials (BOS/EOS/PAD/UNK)
  • Decoding reconstructs the DNA string by overlapping the last 2 bases of successive k-mers
  • Model is untrained; it should be optimized purely via RL or other post-training methods
Downloads last month
3
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support