Human Proteome ESMC Embeddings

Complete layer-wise protein embeddings for 236,252 human proteins using ESMC models

📊 Dataset Summary

This dataset provides pre-computed protein sequence embeddings for the complete human proteome (Homo sapiens GRCh38, Ensembl) using EvolutionaryScale's ESMC protein language models. These embeddings capture evolutionary and structural information useful for protein function prediction, similarity search, and transfer learning tasks - ready to use without requiring expensive inference.

Created by BioLM.ai to support computational biology research and ML competitions.

Key Features:

🧬 236,252 human proteins from Ensembl GRCh38 reference genome
🤖 Two model sizes: ESMC 300M (30 layers, 960 dims) and ESMC 600M (36 layers, 1152 dims)
📐 Layer-wise embeddings: Mean-pooled representations from all transformer layers
✨ High quality: Filtered invalid sequences, verified data integrity
🚀 Ready to use: No inference needed - directly load and use for downstream tasks
📦 Efficient format: Sharded parquet files with snappy compression (~26 GB total)
⚡ Optimized loading: Files sharded to ~3.5 GB each for fast streaming and parallel loading

🎯 Use Cases

Protein function prediction: Train classifiers for GO terms, localization, interactions
Similarity search: Find proteins with similar structure/function
Transfer learning: Use as pre-computed features for any protein task
Competition features: Drop-in features for computational biology competitions
Visualization: Explore protein space with dimensionality reduction
Benchmark datasets: Evaluate protein representation methods

🗂️ Dataset Structure

Files

ESMC 300M Embeddings (3 shards, 3.43 GB each):

esmc_300m_embeddings-train-0000-of-0003.parquet
esmc_300m_embeddings-train-0001-of-0003.parquet
esmc_300m_embeddings-train-0002-of-0003.parquet

ESMC 600M Embeddings (4 shards, 3.71 GB each):

esmc_600m_embeddings-train-0000-of-0004.parquet
esmc_600m_embeddings-train-0001-of-0004.parquet
esmc_600m_embeddings-train-0002-of-0004.parquet
esmc_600m_embeddings-train-0003-of-0004.parquet

Supporting Files:

sequences.parquet (32 MB) - Source protein sequences & metadata
skipped_sequences.txt (2.7 MB) - Filtered sequences log

Dataset	Shards	Size per Shard	Total Size	Total Rows
ESMC 300M	3	~3.43 GB	~10.3 GB	7,087,560
ESMC 600M	4	~3.71 GB	~14.8 GB	8,505,072
Sequences	1	32 MB	32 MB	236,252
Total	8	-	~25.7 GB	-

Why Sharded?

Files are split into ~3.5 GB shards for optimal performance:

✅ Faster downloads: Parallel shard downloads
✅ Memory efficient: Stream one shard at a time
✅ HuggingFace optimized: Automatic shard handling with datasets library
✅ Resumable transfers: Failed downloads can resume individual shards

Schema

Embeddings files (long format: one row per sequence-layer):

{
    'sequence_id': str,           # e.g., "ENSP00000269305.4" (TP53)
    'layer_idx': int,             # 0-29 (300M) or 0-35 (600M)
    'mean_embedding': List[float], # 960-dim (300M) or 1152-dim (600M)
    'sequence_length': int        # Amino acids count
}

Sequences file:

{
    'sequence_id': str,      # Ensembl protein ID
    'sequence': str,         # Amino acid sequence (20 standard AAs)
    'sequence_length': int,  # Length in amino acids
    'description': str       # Full FASTA header with gene metadata
}

🚀 Quick Start

Option 1: HuggingFace Datasets Library (Recommended)

The datasets library automatically handles sharded files:

from datasets import load_dataset
import numpy as np

# Load 600M embeddings (all shards loaded automatically)
ds = load_dataset('biolm/human-proteome-esmc-embeddings', data_files='esmc_600m_embeddings-train-*.parquet')

# Access as pandas DataFrame
df = ds['train'].to_pandas()

# Filter to last layer only
last_layer = df[df['layer_idx'] == 35]
print(f"Loaded {len(last_layer):,} proteins × 1152 dims")

Option 2: PyArrow (Memory Efficient)

Load specific shards or filter on-the-fly:

import pyarrow.parquet as pq
import pandas as pd
from glob import glob

# Load only last layer from all 600M shards
dfs = []
for shard_file in glob('esmc_600m_embeddings-train-*.parquet'):
    table = pq.read_table(
        shard_file,
        filters=[('layer_idx', '==', 35)]  # Last layer only
    )
    dfs.append(table.to_pandas())

df = pd.concat(dfs, ignore_index=True)
print(f"Loaded {len(df):,} protein embeddings")  # 236,252 proteins

Option 3: Polars (Fastest)

import polars as pl

# Lazy load all 600M shards with glob pattern
df = pl.scan_parquet('esmc_600m_embeddings-train-*.parquet')

# Filter and collect efficiently
last_layer = df.filter(pl.col('layer_idx') == 35).collect()
print(f"Shape: {last_layer.shape}")  # (236252, 4)

Load Specific Proteins

import pandas as pd

# Load all shards and filter to specific proteins
df = pd.concat([
    pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
    for i in range(4)
], ignore_index=True)

# Get TP53 tumor suppressor embeddings (all 36 layers)
tp53_data = df[df['sequence_id'] == 'ENSP00000269305.4'].sort_values('layer_idx')
tp53_embeddings = np.array(tp53_data['mean_embedding'].tolist())
print(f"TP53 shape: {tp53_embeddings.shape}")  # (36, 1152)

Train a Classifier (Last Layer Only)

from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd

# Load only last layer from all shards
dfs = []
for i in range(4):  # 4 shards for 600M
    df = pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
    dfs.append(df[df['layer_idx'] == 35])

embeddings_df = pd.concat(dfs, ignore_index=True)

# Extract features
X = np.array(embeddings_df['mean_embedding'].tolist())  # (236252, 1152)
# y = your_labels  # e.g., GO terms, subcellular localization

clf = RandomForestClassifier()
clf.fit(X, y)

Protein Similarity Search

from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

# Load last layer from all shards
dfs = []
for i in range(4):
    df = pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
    dfs.append(df[df['layer_idx'] == 35])

df = pd.concat(dfs, ignore_index=True)

# Query: Find proteins similar to TP53
query_emb = df[df['sequence_id'] == 'ENSP00000269305.4']['mean_embedding'].iloc[0]
all_embs = np.array(df['mean_embedding'].tolist())

similarities = cosine_similarity([query_emb], all_embs)[0]
top_10_indices = similarities.argsort()[-11:-1][::-1]

print("Top 10 proteins similar to TP53:")
for idx in top_10_indices:
    seq_id = df.iloc[idx]['sequence_id']
    sim = similarities[idx]
    print(f"  {seq_id}: {sim:.4f}")

Join with Sequences

import pandas as pd

# Load embeddings (last layer only)
embeddings = pd.concat([
    pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
    for i in range(4)
], ignore_index=True)
embeddings = embeddings[embeddings['layer_idx'] == 35]

# Load sequences
sequences = pd.read_parquet('sequences.parquet')

# Merge
merged = embeddings.merge(sequences, on='sequence_id', how='left')
print(f"Merged shape: {merged.shape}")
print(f"Columns: {merged.columns.tolist()}")

📈 Dataset Statistics

Coverage

Source: Homo sapiens GRCh38 peptide sequences from Ensembl
Total in source: 245,535 sequences
Processed: 236,252 sequences (96.2%)
Filtered: 9,283 sequences (3.8% - containing ambiguous/invalid amino acids)

Sequence Characteristics

Length range: 1 - 35,991 amino acids
Mean length: ~460 AA
Median length: ~282 AA
Valid amino acids: 20 standard (ACDEFGHIKLMNPQRSTVWY)

Model Comparison

Model	Params	Layers	Embed Dim	Shards	Total Size	Total Rows
ESMC 300M	300M	30	960	3	10.3 GB	7,087,560
ESMC 600M	600M	36	1152	4	14.8 GB	8,505,072

🔬 Generation Details

Models

ESMC 300M: EvolutionaryScale/esmc-300m-2024-12 (revision: a19d363)
ESMC 600M: EvolutionaryScale/esmc-600m-2024-12 (revision: d11cc14)
Library: ESMC v3.1.3 from EvolutionaryScale

Processing Pipeline

✅ Tokenize sequences with BOS/EOS tokens
✅ Forward pass through all layers (model.eval(), torch.no_grad())
✅ Remove BOS/EOS tokens from outputs
✅ Mean pool across sequence length dimension
✅ Extract to CPU as float32

Configuration

Batch size: Adaptive (8 for ≤4096 AA, 1 for longer sequences)
Max length: 50,000 amino acids
Random seed: 42 (reproducible)
Hardware: NVIDIA RTX A6000 (48GB VRAM)
Quality checks: ✅ No missing values, ✅ Correct layer counts, ✅ No duplicates
Sharding: Split to ~3.5 GB per shard for optimal HuggingFace compatibility

❓ FAQ

Q: Which layer should I use? A: The last layer (29 for 300M, 35 for 600M) typically works best for downstream tasks. Some applications benefit from intermediate layers or combining multiple layers.

Q: How do I load all shards at once? A: Use glob patterns with pandas/polars:

import pandas as pd
df = pd.concat([
    pd.read_parquet(f) for f in glob('esmc_600m_embeddings-train-*.parquet')
], ignore_index=True)

Or use HuggingFace datasets library which handles shards automatically.

Q: Can I load just one shard? A: Yes! Each shard is independent and contains a subset of proteins. Useful for memory-constrained environments or parallel processing.

Q: 300M vs 600M - which to use? A: 600M is larger and may capture more nuanced patterns. 300M is faster to work with. We recommend trying both!

Q: Are embeddings normalized? A: No, these are raw mean-pooled embeddings. Apply L2 normalization if needed for cosine similarity.

Q: What sequences were filtered out? A: 9,283 sequences (3.8%) containing non-standard amino acids:

X (ambiguous): 9,049 sequences
* (stop codon): 152 sequences
U (selenocysteine): 89 sequences

Q: Can I use this commercially? A: Yes! Under CC BY 4.0 license - free to use with attribution to BioLM.ai.

Q: How are proteins distributed across shards? A: Proteins are split sequentially (by row order) across shards. To get all layers for a protein, you may need to check all shards (though typically a protein's layers are in the same shard).

Q: Which shard contains a specific protein? A: Load the sequences.parquet file to see all sequence IDs, then search each shard. Or use the HuggingFace datasets library which handles this automatically.

📚 Citation

If you use this dataset in your work, please cite:

@dataset{biolm_human_proteome_esmc_2025,
  title={Human Proteome ESMC Embeddings},
  author={BioLM.ai},
  year={2025},
  month={October},
  publisher={HuggingFace},
  url={https://huggingface.co/datasets/biolm/human-proteome-esmc-embeddings}
}

And the ESMC model:

@article{esmc2024,
  title={Evolutionary Scale Modeling: Protein Language Models},
  author={EvolutionaryScale},
  year={2024},
  url={https://github.com/evolutionaryscale/esm}
}

📄 License

CC BY 4.0 - Free to use with attribution to BioLM.ai

Source data (Ensembl): Freely available
ESMC models: Apache 2.0
This dataset: CC BY 4.0

🙏 Acknowledgments

EvolutionaryScale for developing and open-sourcing ESMC models
Ensembl for curating and maintaining the human proteome reference
HuggingFace for hosting and serving this dataset

📞 Contact & Support

Organization: BioLM.ai
Python SDK: py-biolm - Run inference on ESMC and many other biosequence models via API
HuggingFace Discussions: Use the Community tab for questions and feedback
Issues: Report problems via HuggingFace Discussions

Version: 1.0.0 | Last updated: October 2025 | Dataset size: ~26 GB (8 sharded files)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support