Model Card for eDNABERT-S_16S

Model Details

Model Description

This model is our first step towards ecosystem-level modeling.

We finetuned DNABERT-S using all of our eDNA ASVs from the Australian Marine Parks project. We used 36,346 Berry 16S ASVs collected from more than 6,000 samples for finetuning.

A partner model for Miya 12S data is also available.

Developed by: OceanOmics team, Philipp Bayer
Funded by Minderoo Foundation
Model type: BERT
Language(s) (NLP): DNA
License: Apache 2.0
Finetuned from model: DNABERT-S

Uses

Installation

There's a conda environment in this repository in DNABERT_S.yml.

conda env create -f DNABERT_S.yml

Direct Use

I've been using this model to visualise ecosystem embeddings.

import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
from Bio import SeqIO
from sklearn.manifold import TSNE
import numpy as np
from tqdm import tqdm  # For progress tracking

# Device configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

config = AutoConfig.from_pretrained('zhihan1996/DNABERT-S', trust_remote_code = True)

# Load model and tokenizer
tokenizer_16S = AutoTokenizer.from_pretrained('OceanOmics/eDNABERT-S_16S', trust_remote_code=True)
model_16S = AutoModel.from_pretrained('OceanOmics/eDNABERT-S_16S', trust_remote_code=True, config=config)
model_16S.to(device)
model_16S.eval()
names_12, seqs_12 = [], []
for seq in SeqIO.parse('16S_all_ASVs.fasta', 'fasta'):
        seqs_12.append(str(seq.seq))
        names_12.append(str(seq.id))

print(f"Processing {len(seqs_12)} sequences")


# Load model and tokenizer
tokenizer_16S = AutoTokenizer.from_pretrained('OceanOmics/eDNABERT-S_16S', trust_remote_code=True)
model_16S = AutoModel.from_pretrained('OceanOmics/eDNABERT-S_16S', trust_remote_code=True, config=config)
model_16S.to(device)
model_16S.eval()
names_16, seqs_16 = [], []
for seq in SeqIO.parse('16S_all_ASVs.fasta', 'fasta'):
    if 165 <= len(str(seq.seq)) <= 180:  # More efficient condition check
        seqs_16.append(str(seq.seq))
        names_16.append(str(seq.id))

print(f"Processing {len(seqs_16)} sequences")

batch_size = 32 # tested on an A100
num_sequences = len(seqs_16)
all_e_16 = np.zeros((num_sequences, 768))

with torch.no_grad():  # no gradient calculation for inference
    for i in tqdm(range(0, num_sequences, batch_size)):
        batch_seqs = seqs_16[i:i+batch_size]

        inputs = tokenizer_16S(batch_seqs, return_tensors='pt', padding=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}  # Move inputs to device

        hidden_states = model_16S(**inputs)[0]

        for j, hidden_state in enumerate(hidden_states):
            embedding_mean = torch.mean(hidden_state, dim=0)
            all_e_16[i+j] = embedding_mean.cpu().numpy()  # Store directly in pre-allocated array

print("Running TSNE...")
X_embedded = TSNE(
    n_components=2,
    learning_rate='auto',
    init='random',
    perplexity=50,  # Reasonable value
    n_jobs=-1  # Use all available cores
).fit_transform(all_e_16)

print("Saving results...")
with open('odr_all_tsne_16S.optimized.tsv', 'w') as out:
    for a, name in zip(X_embedded, names_16):
        out.write('\t'.join(map(str, list(a) + name.split("XXX"))) + '\n')

You can see results visualised in action here: https://marine-parks.minderoo.org/#!/unknown

Downstream Use

I'm hoping you'll come up with these! It would be great if we could plug this or similar models into ecosystem-level models such as BioAnalyst

Risks and out-of-Scope Use

This model is trained using Berry et al. 16S metabarcoding results based on Australian marine samples. The 16S assay is fairly fish-specific, with some other vertebrate hits such as dolphins, so you might not have the best time applying this model to other organisms or ecosystems.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing

We had many primer dimers in our ASV data, so I got rid of too-short and too-long ASVs. Cehck what your ASVs look like.

Training Hyperparameters

Using the original DNABERT-S training script:

python main.py [.. skipping data flags..]
  --seed 1
  --max_length 2000
  --train_batch_size 8 
  --val_batch_size 8
  --lr 3e-06
  --lr_scale 100
  --epochs 3
  --feat_dim 128
  --temperature 0.05
  --con_method same_species
  --mix
  --mix_alpha 1.0
  --mix_layer_num -1
  --curriculum

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for OceanOmics/eDNABERT-S_16S

Base model

zhihan1996/DNABERT-S

Finetuned

(3)

this model