Human Proteome ESMC Embeddings
π Dataset Summary
This dataset provides pre-computed protein sequence embeddings for the complete human proteome (Homo sapiens GRCh38, Ensembl) using EvolutionaryScale's ESMC protein language models. These embeddings capture evolutionary and structural information useful for protein function prediction, similarity search, and transfer learning tasks - ready to use without requiring expensive inference.
Created by BioLM.ai to support computational biology research and ML competitions.
Key Features:
- 𧬠236,252 human proteins from Ensembl GRCh38 reference genome
- π€ Two model sizes: ESMC 300M (30 layers, 960 dims) and ESMC 600M (36 layers, 1152 dims)
- π Layer-wise embeddings: Mean-pooled representations from all transformer layers
- β¨ High quality: Filtered invalid sequences, verified data integrity
- π Ready to use: No inference needed - directly load and use for downstream tasks
- π¦ Efficient format: Sharded parquet files with snappy compression (~26 GB total)
- β‘ Optimized loading: Files sharded to ~3.5 GB each for fast streaming and parallel loading
π― Use Cases
- Protein function prediction: Train classifiers for GO terms, localization, interactions
- Similarity search: Find proteins with similar structure/function
- Transfer learning: Use as pre-computed features for any protein task
- Competition features: Drop-in features for computational biology competitions
- Visualization: Explore protein space with dimensionality reduction
- Benchmark datasets: Evaluate protein representation methods
ποΈ Dataset Structure
Files
ESMC 300M Embeddings (3 shards, 3.43 GB each):
esmc_300m_embeddings-train-0000-of-0003.parquetesmc_300m_embeddings-train-0001-of-0003.parquetesmc_300m_embeddings-train-0002-of-0003.parquet
ESMC 600M Embeddings (4 shards, 3.71 GB each):
esmc_600m_embeddings-train-0000-of-0004.parquetesmc_600m_embeddings-train-0001-of-0004.parquetesmc_600m_embeddings-train-0002-of-0004.parquetesmc_600m_embeddings-train-0003-of-0004.parquet
Supporting Files:
sequences.parquet(32 MB) - Source protein sequences & metadataskipped_sequences.txt(2.7 MB) - Filtered sequences log
| Dataset | Shards | Size per Shard | Total Size | Total Rows |
|---|---|---|---|---|
| ESMC 300M | 3 | ~3.43 GB | ~10.3 GB | 7,087,560 |
| ESMC 600M | 4 | ~3.71 GB | ~14.8 GB | 8,505,072 |
| Sequences | 1 | 32 MB | 32 MB | 236,252 |
| Total | 8 | - | ~25.7 GB | - |
Why Sharded?
Files are split into ~3.5 GB shards for optimal performance:
- β Faster downloads: Parallel shard downloads
- β Memory efficient: Stream one shard at a time
- β
HuggingFace optimized: Automatic shard handling with
datasetslibrary - β Resumable transfers: Failed downloads can resume individual shards
Schema
Embeddings files (long format: one row per sequence-layer):
{
'sequence_id': str, # e.g., "ENSP00000269305.4" (TP53)
'layer_idx': int, # 0-29 (300M) or 0-35 (600M)
'mean_embedding': List[float], # 960-dim (300M) or 1152-dim (600M)
'sequence_length': int # Amino acids count
}
Sequences file:
{
'sequence_id': str, # Ensembl protein ID
'sequence': str, # Amino acid sequence (20 standard AAs)
'sequence_length': int, # Length in amino acids
'description': str # Full FASTA header with gene metadata
}
π Quick Start
Option 1: HuggingFace Datasets Library (Recommended)
The datasets library automatically handles sharded files:
from datasets import load_dataset
import numpy as np
# Load 600M embeddings (all shards loaded automatically)
ds = load_dataset('biolm/human-proteome-esmc-embeddings', data_files='esmc_600m_embeddings-train-*.parquet')
# Access as pandas DataFrame
df = ds['train'].to_pandas()
# Filter to last layer only
last_layer = df[df['layer_idx'] == 35]
print(f"Loaded {len(last_layer):,} proteins Γ 1152 dims")
Option 2: PyArrow (Memory Efficient)
Load specific shards or filter on-the-fly:
import pyarrow.parquet as pq
import pandas as pd
from glob import glob
# Load only last layer from all 600M shards
dfs = []
for shard_file in glob('esmc_600m_embeddings-train-*.parquet'):
table = pq.read_table(
shard_file,
filters=[('layer_idx', '==', 35)] # Last layer only
)
dfs.append(table.to_pandas())
df = pd.concat(dfs, ignore_index=True)
print(f"Loaded {len(df):,} protein embeddings") # 236,252 proteins
Option 3: Polars (Fastest)
import polars as pl
# Lazy load all 600M shards with glob pattern
df = pl.scan_parquet('esmc_600m_embeddings-train-*.parquet')
# Filter and collect efficiently
last_layer = df.filter(pl.col('layer_idx') == 35).collect()
print(f"Shape: {last_layer.shape}") # (236252, 4)
Load Specific Proteins
import pandas as pd
# Load all shards and filter to specific proteins
df = pd.concat([
pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
for i in range(4)
], ignore_index=True)
# Get TP53 tumor suppressor embeddings (all 36 layers)
tp53_data = df[df['sequence_id'] == 'ENSP00000269305.4'].sort_values('layer_idx')
tp53_embeddings = np.array(tp53_data['mean_embedding'].tolist())
print(f"TP53 shape: {tp53_embeddings.shape}") # (36, 1152)
Train a Classifier (Last Layer Only)
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd
# Load only last layer from all shards
dfs = []
for i in range(4): # 4 shards for 600M
df = pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
dfs.append(df[df['layer_idx'] == 35])
embeddings_df = pd.concat(dfs, ignore_index=True)
# Extract features
X = np.array(embeddings_df['mean_embedding'].tolist()) # (236252, 1152)
# y = your_labels # e.g., GO terms, subcellular localization
clf = RandomForestClassifier()
clf.fit(X, y)
Protein Similarity Search
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
# Load last layer from all shards
dfs = []
for i in range(4):
df = pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
dfs.append(df[df['layer_idx'] == 35])
df = pd.concat(dfs, ignore_index=True)
# Query: Find proteins similar to TP53
query_emb = df[df['sequence_id'] == 'ENSP00000269305.4']['mean_embedding'].iloc[0]
all_embs = np.array(df['mean_embedding'].tolist())
similarities = cosine_similarity([query_emb], all_embs)[0]
top_10_indices = similarities.argsort()[-11:-1][::-1]
print("Top 10 proteins similar to TP53:")
for idx in top_10_indices:
seq_id = df.iloc[idx]['sequence_id']
sim = similarities[idx]
print(f" {seq_id}: {sim:.4f}")
Join with Sequences
import pandas as pd
# Load embeddings (last layer only)
embeddings = pd.concat([
pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
for i in range(4)
], ignore_index=True)
embeddings = embeddings[embeddings['layer_idx'] == 35]
# Load sequences
sequences = pd.read_parquet('sequences.parquet')
# Merge
merged = embeddings.merge(sequences, on='sequence_id', how='left')
print(f"Merged shape: {merged.shape}")
print(f"Columns: {merged.columns.tolist()}")
π Dataset Statistics
Coverage
- Source: Homo sapiens GRCh38 peptide sequences from Ensembl
- Total in source: 245,535 sequences
- Processed: 236,252 sequences (96.2%)
- Filtered: 9,283 sequences (3.8% - containing ambiguous/invalid amino acids)
Sequence Characteristics
- Length range: 1 - 35,991 amino acids
- Mean length: ~460 AA
- Median length: ~282 AA
- Valid amino acids: 20 standard (ACDEFGHIKLMNPQRSTVWY)
Model Comparison
| Model | Params | Layers | Embed Dim | Shards | Total Size | Total Rows |
|---|---|---|---|---|---|---|
| ESMC 300M | 300M | 30 | 960 | 3 | 10.3 GB | 7,087,560 |
| ESMC 600M | 600M | 36 | 1152 | 4 | 14.8 GB | 8,505,072 |
π¬ Generation Details
Models
- ESMC 300M:
EvolutionaryScale/esmc-300m-2024-12(revision:a19d363) - ESMC 600M:
EvolutionaryScale/esmc-600m-2024-12(revision:d11cc14) - Library: ESMC v3.1.3 from EvolutionaryScale
Processing Pipeline
- β Tokenize sequences with BOS/EOS tokens
- β
Forward pass through all layers (
model.eval(),torch.no_grad()) - β Remove BOS/EOS tokens from outputs
- β Mean pool across sequence length dimension
- β Extract to CPU as float32
Configuration
- Batch size: Adaptive (8 for β€4096 AA, 1 for longer sequences)
- Max length: 50,000 amino acids
- Random seed: 42 (reproducible)
- Hardware: NVIDIA RTX A6000 (48GB VRAM)
- Quality checks: β No missing values, β Correct layer counts, β No duplicates
- Sharding: Split to ~3.5 GB per shard for optimal HuggingFace compatibility
β FAQ
Q: Which layer should I use? A: The last layer (29 for 300M, 35 for 600M) typically works best for downstream tasks. Some applications benefit from intermediate layers or combining multiple layers.
Q: How do I load all shards at once? A: Use glob patterns with pandas/polars:
import pandas as pd
df = pd.concat([
pd.read_parquet(f) for f in glob('esmc_600m_embeddings-train-*.parquet')
], ignore_index=True)
Or use HuggingFace datasets library which handles shards automatically.
Q: Can I load just one shard? A: Yes! Each shard is independent and contains a subset of proteins. Useful for memory-constrained environments or parallel processing.
Q: 300M vs 600M - which to use? A: 600M is larger and may capture more nuanced patterns. 300M is faster to work with. We recommend trying both!
Q: Are embeddings normalized? A: No, these are raw mean-pooled embeddings. Apply L2 normalization if needed for cosine similarity.
Q: What sequences were filtered out? A: 9,283 sequences (3.8%) containing non-standard amino acids:
- X (ambiguous): 9,049 sequences
- * (stop codon): 152 sequences
- U (selenocysteine): 89 sequences
Q: Can I use this commercially? A: Yes! Under CC BY 4.0 license - free to use with attribution to BioLM.ai.
Q: How are proteins distributed across shards? A: Proteins are split sequentially (by row order) across shards. To get all layers for a protein, you may need to check all shards (though typically a protein's layers are in the same shard).
Q: Which shard contains a specific protein?
A: Load the sequences.parquet file to see all sequence IDs, then search each shard. Or use the HuggingFace datasets library which handles this automatically.
π Citation
If you use this dataset in your work, please cite:
@dataset{biolm_human_proteome_esmc_2025,
title={Human Proteome ESMC Embeddings},
author={BioLM.ai},
year={2025},
month={October},
publisher={HuggingFace},
url={https://huggingface.co/datasets/biolm/human-proteome-esmc-embeddings}
}
And the ESMC model:
@article{esmc2024,
title={Evolutionary Scale Modeling: Protein Language Models},
author={EvolutionaryScale},
year={2024},
url={https://github.com/evolutionaryscale/esm}
}
π License
CC BY 4.0 - Free to use with attribution to BioLM.ai
- Source data (Ensembl): Freely available
- ESMC models: Apache 2.0
- This dataset: CC BY 4.0
π Acknowledgments
- EvolutionaryScale for developing and open-sourcing ESMC models
- Ensembl for curating and maintaining the human proteome reference
- HuggingFace for hosting and serving this dataset
π Contact & Support
- Organization: BioLM.ai
- Python SDK: py-biolm - Run inference on ESMC and many other biosequence models via API
- HuggingFace Discussions: Use the Community tab for questions and feedback
- Issues: Report problems via HuggingFace Discussions
Version: 1.0.0 | Last updated: October 2025 | Dataset size: ~26 GB (8 sharded files)