Overview
This is the C2S-Pythia-410m-cell-type-prediction model, based on the Pythia-410m architecture developed by EleutherAI, fine-tuned using Cell2Sentence (C2S) on a diverse set of single-cell RNA sequencing (scRNA-seq) datasets from CellxGene and the Human Cell Atlas. Cell2Sentence is an innovative approach for adapting large language models (LLMs) to single-cell biology by transforming scRNA-seq data into "cell sentences"—sequences of gene names ordered by expression levels. This transformation enables LLMs to leverage their natural language processing capabilities for various single-cell tasks, with a focus on cell type prediction in this model.
Training Data
This model was trained on over 57 million human and mouse cells gathered from over 800 single-cell RNA sequencing datasets from CellxGene and the Human Cell Atlas. This dataset covers a broad range of cell types and conditions from multiple tissues in both human and mouse.
This model was trained with the top 200 genes per cell sentence.
Tasks
This model is designed for:
- Cell type prediction: Predicting the cell type based on the "cell sentence" generated from scRNA-seq data.
Cell2Sentence Links
- GitHub: https://github.com/vandijklab/cell2sentence (Note: Codebase has CC BY-NC-ND 4.0 license. Only weights shared on Hugging Face are CC0 1.0)
- Paper: https://www.biorxiv.org/content/10.1101/2023.09.11.557287v3
Pythia Links
- Paper: https://arxiv.org/pdf/2304.01373
- Hugging Face: https://huggingface.co/EleutherAI/pythia-410m
- Downloads last month
- 1,901
