🧬 ChemMiniQ3-SAbRL (Synthetic Accessibility with Bioaware RL)

ChemMiniQ3-SARL is a lightweight experimental generative model for chemistry, built on mini Qwen2-like backbone with multi-horizon predictive loss for SELFIES molecular representations.
It introduces a new reinforcement learning approach as next iteration of ChemMiniQ3-HoriFIE that combines:

🧩 Synthetic Accessibility (SA) Rewards — guiding generation with a classifier (gbyuvd/synthaccess-chemselfies) to favor molecules that are easier to synthesize.
🔄 Cyclical Gradual Generation — a curriculum learning strategy that gradually increases molecule length up to 30 tokens, then resets and repeats, ensuring exploration of short and long structures in balance.

Disclaimer: For Academic Purposes Only

The information and model provided is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The author do not guarantee the accuracy, completeness, or reliability of the information. Prototype research code — not production-ready. Learning by building.

⚙️ Core Features

✅ Qwen2-like Mini Backbone – Efficient causal LM architecture
✅ Multi-Token Prediction (MTP Head) – Parallel prediction of 1–3 future tokens
✅ Horizon Loss – Weighted multi-horizon objectives for long-term coherence
✅ SELFIES-native Tokenizer – Robust encoding with FastChemTokenizer
✅ Ranger21 Optimizer – Warmup/warmdown scheduling for stable training
✅ Gradient Checkpointing & Streaming Dataset Loader – Lightweight hardware-friendly

🧪 Reinforcement Learning Enhancements

1️⃣ SA-Guided PPO-KL Fine-Tuning

Uses gbyuvd/synthaccess-chemselfies as a reward model
Rewards “Easy” synthetic accessibility predictions
Penalizes “Hard” molecules
Can operate in SA-only mode or be mixed with ChemQ3 multi-property rewards

2️⃣ Cyclical Gradual Curriculum

Gradually increases max generation length: 10 → 15 → 20 → 25 → 30 tokens
After reaching 30, it resets back to 10 and repeats the cycle
Ensures diversity of molecule sizes while still converging toward more complex structures
Why 30s? because FastChemTokenizerSelfies' core average tokens/sentence is 33.41 ± 1.80 tokens on ~3M dataset

🚀 Why ChemMiniQ3-SARL?

Prior approaches optimized only validity or physicochemical rules (Lipinski, etc.)
Our method explicitly biases generation toward molecules that are not just valid, but also easier to synthesize
The cyclical gradual generation keeps training dynamic, avoiding overfitting to a single molecule size or complexity

💡 Target domain: molecular generation (SELFIES).
🔬 Goal: molecules that are valid, bioaware, and synthetically accessible.
🚀 New approach: combining SA-guided rewards + cyclical gradual curriculum for reinforcement learning.

🔮 Planned Experiments & Next Steps

We are actively working on scaling up ChemMiniQ3-SARL with more ambitious experiments:

📚 Pretraining on a larger dataset – up to 2.9M SELFIES molecules
⏱ Finetuning with extended training steps – to push stability and reward alignment further
🔬 Comparative evaluation – SA-only vs ChemQ3 vs Mix reward modes
🧪 Benchmarking – evaluating validity, novelty, drug-likeness, and synthetic accessibility metrics

❤️ Support the Project

Training and scaling require significant computational resources.
If you’d like to support this research (e.g., helping us rent compute servers for pretraining and finetuning), you can contribute here:

Every bit of support helps us push ChemMiniQ3-SARL further! 🚀🧬

To-Do

[ongoing] Review, clean, and test train with existing codes
Warm up training on 14K dataset for MTP
Warm up PPO-RL with only Bioaware set on for 7000 steps
[ongoing] Warm up PPO-RL with only SA set on for 7000 steps
[ongoing] Test and observe the stability of Mixed Rewards for 7000 steps
Upload both warm-up MTP and PPO-RL models to HF repo
Write demo blocks and demo JupyterNotebook on training from scratch and how to generate using pretrained model(s)
Ablation studies
Implement HF Automodel compatible modules if performance benefit(s) confirmed
Complete pretraining on all ~3M dataset (when possible)
- Chunk I
- Chunk II
- Chunk III
- Chunk IV
Publish complete pretraining on GitHub and HF (if compatible)
Complete RL fine-tuning on verified rewards system.

📝 Notes & Observations

During cyclical gradual training, the model seems to stop generating separated salt molecules + ions on seq_len < 30 at around step ~4000.
- This corresponds to ~57 steps for each sequence length in the gradual schedule.
- Suggests that the curriculum is helping the model bias toward more coherent, non-fragmented molecular outputs.
Current state before SA-finetuning, using 100 samples generation

After 1000-steps of SA-only RL fine-tuning on the bioaware fine-tuned model using ChemFIE-SA

References

BibTeX

Qwen2

@misc{yang2024qwen2technicalreport,
      title={Qwen2 Technical Report}, 
      author={An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jianxin Yang and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Xuejing Liu and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhifang Guo and Zhihao Fan},
      year={2024},
      eprint={2407.10671},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.10671}, 
}

COCONUTDB

@article{sorokina2021coconut,
  title={COCONUT online: Collection of Open Natural Products database},
  author={Sorokina, Maria and Merseburger, Peter and Rajan, Kohulan and Yirik, Mehmet Aziz and Steinbeck, Christoph},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={2},
  year={2021},
  doi={10.1186/s13321-020-00478-9}
}

ChemBL34

@article{zdrazil2023chembl,
  title={The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods},
  author={Zdrazil, Barbara and Felix, Eloy and Hunter, Fiona and Manners, Emma J and Blackshaw, James and Corbett, Sybilla and de Veij, Marleen and Ioannidis, Harris and Lopez, David Mendez and Mosquera, Juan F and Magarinos, Maria Paula and Bosc, Nicolas and Arcila, Ricardo and Kizil{\"o}ren, Tevfik and Gaulton, Anna and Bento, A Patr{\'i}cia and Adasme, Melissa F and Monecke, Peter and Landrum, Gregory A and Leach, Andrew R},
  journal={Nucleic Acids Research},
  year={2023},
  volume={gkad1004},
  doi={10.1093/nar/gkad1004}
}

@misc{chembl34,
  title={ChemBL34},
  year={2023},
  doi={10.6019/CHEMBL.database.34}
}

SuperNatural3

@article{Gallo2023,
  author = {Gallo, K and Kemmler, E and Goede, A and Becker, F and Dunkel, M and Preissner, R and Banerjee, P},
  title = {{SuperNatural 3.0-a database of natural products and natural product-based derivatives}},
  journal = {Nucleic Acids Research},
  year = {2023},
  month = jan,
  day = {6},
  volume = {51},
  number = {D1},
  pages = {D654-D659},
  doi = {10.1093/nar/gkac1008}
}

Ranger21 Optimizer

@article{wright2021ranger21,
      title={Ranger21: a synergistic deep learning optimizer}, 
      author={Wright, Less and Demeure, Nestor},
      year={2021},
      journal={arXiv preprint arXiv:2106.13731},
}

Downloads last month: 10

Collection including gbyuvd/ChemMiniQ3-SAbRL

Experiments

Collection

6 items • Updated Oct 27