FAMA: A Scalable Foundational Astronomical Masked Autoencoder
FAMA (Foundational Astronomical Masked Autoencoder) is a self-supervised, foundational image model based on the Masked Autoencoder (MAE) architecture, optimized for the unique properties of astronomical data. It is designed to overcome the challenge of heterogeneous, unlabelled image datasets accumulating from wide-field surveys like the DESI Legacy Imaging Surveys and the upcoming Chinese Space Station Telescope (CSST).
The model achieves robust, generalized feature extraction by pre-training the Vision Transformer (ViT) encoder using a high-ratio masking strategy.
π‘ Key Results and Highlights
- Superior Performance: FAMA yields significant performance gains over supervised baselines in downstream tasks like galaxy classification, object detection, and redshift estimation.
- Robust Transferability: It demonstrates effective transferability from DESI to SDSS data, successfully mitigating the domain shift problem between different observational instruments.
- Optimal MAE Configuration: The model uses an optimized MAE configuration specific to astronomical data: a 75% masking ratio, a 1-layer deep decoder, and a 512-dimension wide decoder.
π FAMA Architecture Specifications
FAMA adopts an asymmetric encoder-decoder architecture, utilizing standard ViT models (ViT-B, ViT-L, ViT-H) for the encoder backbone. The lightweight decoder is discarded after pre-training.
| Architecture | Layers | Patch Size | Embed Dim | MLP Size | Heads | Parameters |
|---|---|---|---|---|---|---|
| ViT-Base (FAMA-B) | 12 | 16 | 768 | 3,072 | 12 | 86M |
| ViT-Large (FAMA-L) | 24 | 16 | 1,024 | 4,096 | 16 | 303M |
| ViT-Huge (FAMA-H) | 24 | 14 | 1,536 | 6,144 | 16 | 680M |
π¦ Model Weights (Pre-trained Encoder Only)
The weights provided below are the pre-trained encoders (ViT-B, ViT-L, ViT-H) from the self-supervised MAE phase, ready for transfer learning via fine-tuning or linear probing.
| Model Size | Weights File | Pre-train Data |
|---|---|---|
| FAMA-B | base_patch16.pth |
DESI-1M |
| FAMA-L | large_patch16.pth |
DESI-1M |
| FAMA-H | huge_patch14.pth |
DESI-1M |
Note: The DESI-1M dataset was a random sample of 2 million galaxies from the DESI Legacy Imaging Surveys DR9, augmented with background cutouts. The actual DESI-1M dataset size is 1 million samples used for the pre-training experiment.
π Performance Benchmarks
FAMA models were rigorously validated across three distinct transfer learning tasks: Classification (Galaxy Morphology), Regression (Photometric Redshift), and Detection (Gravitational Lensing).
1. Galaxy Classification (Full Fine-tuning)
| Method | backbone | Pre-train Data | Acc on galaxy-desi | Acc on galaxy-sdss |
|---|---|---|---|---|
| FAMA (ours) | ViT-H | DESI-1M | 89.10 | 96.02 |
2. Gravitational Lensing Detection
FAMA achieves the highest Average Precision (AP) scores for strong gravitational lensing detection using the ViTDet adaptation.
| Method | backbone | AP | AP75 |
|---|---|---|---|
| FAMA (ours) | ViT-H | 42.62 | 49.43 |
3. Redshift Prediction (Cross-Domain)
The pre-trained model on DESI data is fine-tuned on the SDSS Redshift dataset.
| backbone | Ξz (Bias, Lower is Better) | ΟMAD (Dispersion, Lower is Better) |
|---|---|---|
| FAMA ViT-H | 0.51 Γ 10β»β΄ | 0.56 Γ 10β»Β² |
π οΈ How to Use for Transfer Learning
The following steps outline the use of the FAMA encoder weights for fine-tuning on a downstream task (e.g., classification).
1. Preprocessing
The model was pre-trained using the following data processing steps:
- Input image size: 3 Γ 256 Γ 256 pixels, extracted at 0.262 arcsec/pixel in the g, r, and z bands.
- Normalization: The training utilized channel-wise mean and standard deviation calculated from the DESI-2M dataset.
- Final input to ViT is 224 Γ 224 (implicitly or through a resizing step).
2. Fine-Tuning Setup
Load the weights into a standard ViT encoder and attach a task-specific head.
- Classification: Attach a Linear Layer to the ViT's final [CLS] token output. Use Cross-Entropy loss.
- Redshift Regression: Attach a Linear Regression Head to the ViT's final [CLS] token output. Use Mean Squared Error (MSE) loss.
- Object Detection: Adapt the ViT to the ViTDet framework, which builds a multi-scale feature pyramid from the ViT blocks.
3. Hyperparameters (Example for Classification)
The following fine-tuning configurations were used for the galaxy-desi classification task:
| Config | ViT-Base | ViT-Large | ViT-Huge |
|---|---|---|---|
| Optimizer | AdamW | AdamW | AdamW |
| Learning Rate | 1.5 Γ 10β»Β³ | 2 Γ 10β»Β³ | 1 Γ 10β»Β³ |
| Batch Size | 64 | 64 | 32 |
| TrainingEpochs | 50 | 50 | 50 |
| LR Schedule | Cosine Decay | Cosine Decay | Cosine Decay |
π Citation
If you use FAMA in your research, please cite the associated work:
@article{FAMA_2025,
title={FAMA -- a Scalable Foundational Astronomical Masked Autoencoder for Astronomical Image Analysis},
author={Lv, Jiameng and Li, Xu and Cao, Liang and Gao, Xi and Li, Nan and Fu, Mingxiang and Li, Yushan and Duan, Manni and Jia, Peng},
journal={Preprint submitted to Elsevier},
year={2025}
}