|
|
--- |
|
|
license: mit |
|
|
pipeline_tag: tabular-regression |
|
|
tags: |
|
|
- chemistry |
|
|
- microbiology |
|
|
- antibiotics |
|
|
library_name: duvida |
|
|
datasets: |
|
|
- scbirlab/thomas-2018-spark-wt |
|
|
--- |
|
|
|
|
|
# Predictor of _Yersinia enterocolitica_ MICs |
|
|
|
|
|
_Updated:_ Tue 1 Apr 08:02:53 BST 2025 |
|
|
|
|
|
Trained on the _Yersinia enterocolitica_, WT accumulator phenotype subset of the [human-curated SPARK dataset](https://doi.org/10.1021/acsinfecdis.8b00193) (1405 rows in total for _Yersinia enterocolitica_). |
|
|
|
|
|
## Model details |
|
|
|
|
|
This model was trained using [our Duvida framework](https://github.com/scbirlab/duvida), |
|
|
as a result of hyperparameter searches and selecting the model that performs best on unseen test data |
|
|
(from a scaffold split). |
|
|
|
|
|
Duvida also saves the training data in this checkpoint to allows the calculation of uncertainty metrics |
|
|
based on that training data. |
|
|
|
|
|
This model is the best regression model from a hyperparameter search, determined |
|
|
by Pearson's $$r$$ on a held-out test set not used in training or early stopping. |
|
|
|
|
|
### Model architecture |
|
|
|
|
|
- **Regression** |
|
|
|
|
|
```json |
|
|
|
|
|
{ |
|
|
"dropout": 0.2, |
|
|
"ensemble_size": 3, |
|
|
"extra_featurizers": null, |
|
|
"learning_rate": 0.0001, |
|
|
"model_class": "ChempropModelBox", |
|
|
"n_hidden": 5, |
|
|
"n_units": 8, |
|
|
"use_2d": false, |
|
|
"use_fp": true |
|
|
} |
|
|
``` |
|
|
|
|
|
### Model usage |
|
|
|
|
|
You can use this model with: |
|
|
|
|
|
```python |
|
|
from duvida.autoclasses import AutoModelBox |
|
|
modelbox = AutoModelBox.from_pretrained("hf://scbirlab/spark-dv-2503-yent") |
|
|
modelbox.predict(filename=..., inputs=[...], columns=[...]) # make predictions on your own data |
|
|
``` |
|
|
|
|
|
## Training details |
|
|
|
|
|
- **Dataset:** [SPARK, WT accumulator, _Yersinia enterocolitica_ subset](https://huggingface.co/datasets/scbirlab/thomas-2018-spark-wt) (1405 rows in total for _Yersinia enterocolitica_) |
|
|
- **Input column:** smiles |
|
|
- **Output column:** pmic |
|
|
- **Split type:** Murcko scaffold |
|
|
- **Split proportions:** |
|
|
- 70% training (984 rows) |
|
|
- 15% validation (for early stopping) (210 rows) |
|
|
- 15% test (for selecting hyperparameters) (211 rows) |
|
|
|
|
|
Here is the training log: |
|
|
|
|
|
<img src="training-log.png" width=450> |
|
|
|
|
|
And these are the evaluation scores. |
|
|
|
|
|
Train (984 rows): |
|
|
|
|
|
```json |
|
|
|
|
|
{ |
|
|
"Pearson r": 0.9610046864345855, |
|
|
"RMSE": 0.24681584537029266, |
|
|
"Spearman rho": 0.9570068923333792 |
|
|
} |
|
|
``` |
|
|
|
|
|
<img src="predictions_train.png" width=450> |
|
|
|
|
|
Validation (210 rows): |
|
|
|
|
|
```json |
|
|
|
|
|
{ |
|
|
"Pearson r": 0.6104635893879058, |
|
|
"RMSE": 0.618925929069519, |
|
|
"Spearman rho": 0.6049210503171903 |
|
|
} |
|
|
``` |
|
|
|
|
|
<img src="predictions_validation.png" width=450> |
|
|
|
|
|
Test (211 rows): |
|
|
|
|
|
```json |
|
|
|
|
|
{ |
|
|
"Pearson r": 0.6665680561527187, |
|
|
"RMSE": 0.607518196105957, |
|
|
"Spearman rho": 0.6959985257844904 |
|
|
} |
|
|
``` |
|
|
|
|
|
<img src="predictions_test.png" width=450> |
|
|
|
|
|
## Training data details |
|
|
|
|
|
The training data were collated by the authors of: |
|
|
|
|
|
> Joe Thomas, Marc Navre, Aileen Rubio, and Allan Coukell |
|
|
> Shared Platform for Antibiotic Research and Knowledge: A Collaborative Tool to SPARK Antibiotic Discovery |
|
|
> ACS Infectious Diseases 2018 4 (11), 1536-1539 |
|
|
> DOI: 10.1021/acsinfecdis.8b00193 |
|
|
|
|
|
We cleaned the original SPARK dataset to subset the most relevant columns, remove empty values, |
|
|
give succint column titles, and split by species. |
|
|
|
|
|
This particular dataset retains only measurements on bacteria with wild-type accumulation phenotypes. |
|
|
|
|
|
### Dataset Sources |
|
|
|
|
|
- **Repository:** https://www.collaborativedrug.com/spark-data-downloads |
|
|
- **Paper:** https://doi.org/10.1021/acsinfecdis.8b00193 |
|
|
|
|
|
### Data Collection and Processing |
|
|
|
|
|
Data were processed using [schemist](https://github.com/scbirlab/schemist), a tool for processing chemical datasets. |
|
|
|
|
|
The SMILES strings have been canonicalized, and split into training (70%), validation (15%), and test (15%) sets |
|
|
by Murcko scaffold for each species with more than 1000 entries. Additional features like molecular weight and |
|
|
topological polar surface area have also been calculated. |
|
|
|
|
|
### Who are the source data producers? |
|
|
|
|
|
Joe Thomas, Marc Navre, Aileen Rubio, and Allan Coukell |