McClain/naive-dna-llama-6mer
Randomly initialized LLaMA model for DNA sequence generation using a custom 6-mer tokenizer (A,C,G,T) with stride 2.
This repository contains:
- A custom tokenizer (
tokenization_naive_dna_kmer.py) implementing 6-mer tokenization with stride 2 - Tokenizer files with
auto_mapsoAutoTokenizer.from_pretrained(..., trust_remote_code=True)works - A randomly initialized
LlamaForCausalLMconfig and weights sized per the provided hyperparameters
Intended usage:
- Research baseline for reinforcement learning from a completely untrained policy
- Load with trust_remote_code=True so the custom tokenizer can be imported
Example:
from transformers import AutoTokenizer, AutoModelForCausalLM
repo = "McClain/naive-dna-llama-6mer"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True)
Notes:
- Tokenizer vocabulary size is 4096 6-mers + 4 specials (BOS/EOS/PAD/UNK)
- Decoding reconstructs the DNA string by overlapping the last 2 bases of successive k-mers
- Model is untrained; it should be optimized purely via RL or other post-training methods
- Downloads last month
- 3