Model Card for eDNABERT-S_16S
Model Details
Model Description
This model is our first step towards ecosystem-level modeling.
We finetuned DNABERT-S using all of our eDNA ASVs from the Australian Marine Parks project. We used 36,346 Berry 16S ASVs collected from more than 6,000 samples for finetuning.
A partner model for Miya 12S data is also available.
- Developed by: OceanOmics team, Philipp Bayer
- Funded by Minderoo Foundation
- Model type: BERT
- Language(s) (NLP): DNA
- License: Apache 2.0
- Finetuned from model: DNABERT-S
Uses
Installation
There's a conda environment in this repository in DNABERT_S.yml.
conda env create -f DNABERT_S.yml
Direct Use
I've been using this model to visualise ecosystem embeddings.
import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from Bio import SeqIO
from sklearn.manifold import TSNE
import numpy as np
from tqdm import tqdm # For progress tracking
# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
config = AutoConfig.from_pretrained('zhihan1996/DNABERT-S', trust_remote_code = True)
# Load model and tokenizer
tokenizer_16S = AutoTokenizer.from_pretrained('OceanOmics/eDNABERT-S_16S', trust_remote_code=True)
model_16S = AutoModel.from_pretrained('OceanOmics/eDNABERT-S_16S', trust_remote_code=True, config=config)
model_16S.to(device)
model_16S.eval()
names_12, seqs_12 = [], []
for seq in SeqIO.parse('16S_all_ASVs.fasta', 'fasta'):
seqs_12.append(str(seq.seq))
names_12.append(str(seq.id))
print(f"Processing {len(seqs_12)} sequences")
# Load model and tokenizer
tokenizer_16S = AutoTokenizer.from_pretrained('OceanOmics/eDNABERT-S_16S', trust_remote_code=True)
model_16S = AutoModel.from_pretrained('OceanOmics/eDNABERT-S_16S', trust_remote_code=True, config=config)
model_16S.to(device)
model_16S.eval()
names_16, seqs_16 = [], []
for seq in SeqIO.parse('16S_all_ASVs.fasta', 'fasta'):
if 165 <= len(str(seq.seq)) <= 180: # More efficient condition check
seqs_16.append(str(seq.seq))
names_16.append(str(seq.id))
print(f"Processing {len(seqs_16)} sequences")
batch_size = 32 # tested on an A100
num_sequences = len(seqs_16)
all_e_16 = np.zeros((num_sequences, 768))
with torch.no_grad(): # no gradient calculation for inference
for i in tqdm(range(0, num_sequences, batch_size)):
batch_seqs = seqs_16[i:i+batch_size]
inputs = tokenizer_16S(batch_seqs, return_tensors='pt', padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()} # Move inputs to device
hidden_states = model_16S(**inputs)[0]
for j, hidden_state in enumerate(hidden_states):
embedding_mean = torch.mean(hidden_state, dim=0)
all_e_16[i+j] = embedding_mean.cpu().numpy() # Store directly in pre-allocated array
print("Running TSNE...")
X_embedded = TSNE(
n_components=2,
learning_rate='auto',
init='random',
perplexity=50, # Reasonable value
n_jobs=-1 # Use all available cores
).fit_transform(all_e_16)
print("Saving results...")
with open('odr_all_tsne_16S.optimized.tsv', 'w') as out:
for a, name in zip(X_embedded, names_16):
out.write('\t'.join(map(str, list(a) + name.split("XXX"))) + '\n')
You can see results visualised in action here: https://marine-parks.minderoo.org/#!/unknown
Downstream Use
I'm hoping you'll come up with these! It would be great if we could plug this or similar models into ecosystem-level models such as BioAnalyst
Risks and out-of-Scope Use
This model is trained using Berry et al. 16S metabarcoding results based on Australian marine samples. The 16S assay is fairly fish-specific, with some other vertebrate hits such as dolphins, so you might not have the best time applying this model to other organisms or ecosystems.
How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
Training Details
Training Data
[More Information Needed]
Training Procedure
Preprocessing
We had many primer dimers in our ASV data, so I got rid of too-short and too-long ASVs. Cehck what your ASVs look like.
Training Hyperparameters
Using the original DNABERT-S training script:
python main.py [.. skipping data flags..]
--seed 1
--max_length 2000
--train_batch_size 8
--val_batch_size 8
--lr 3e-06
--lr_scale 100
--epochs 3
--feat_dim 128
--temperature 0.05
--con_method same_species
--mix
--mix_alpha 1.0
--mix_layer_num -1
--curriculum
Speeds, Sizes, Times [optional]
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
[More Information Needed]
Summary
Model Examination [optional]
[More Information Needed]
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]
- Downloads last month
- 4
Model tree for OceanOmics/eDNABERT-S_16S
Base model
zhihan1996/DNABERT-S