Human Proteome ESMC Embeddings

Complete layer-wise protein embeddings for 236,252 human proteins using ESMC models

License: CC BY 4.0 ESMC Model BioLM.ai

πŸ“Š Dataset Summary

This dataset provides pre-computed protein sequence embeddings for the complete human proteome (Homo sapiens GRCh38, Ensembl) using EvolutionaryScale's ESMC protein language models. These embeddings capture evolutionary and structural information useful for protein function prediction, similarity search, and transfer learning tasks - ready to use without requiring expensive inference.

Created by BioLM.ai to support computational biology research and ML competitions.

Key Features:

  • 🧬 236,252 human proteins from Ensembl GRCh38 reference genome
  • πŸ€– Two model sizes: ESMC 300M (30 layers, 960 dims) and ESMC 600M (36 layers, 1152 dims)
  • πŸ“ Layer-wise embeddings: Mean-pooled representations from all transformer layers
  • ✨ High quality: Filtered invalid sequences, verified data integrity
  • πŸš€ Ready to use: No inference needed - directly load and use for downstream tasks
  • πŸ“¦ Efficient format: Sharded parquet files with snappy compression (~26 GB total)
  • ⚑ Optimized loading: Files sharded to ~3.5 GB each for fast streaming and parallel loading

🎯 Use Cases

  • Protein function prediction: Train classifiers for GO terms, localization, interactions
  • Similarity search: Find proteins with similar structure/function
  • Transfer learning: Use as pre-computed features for any protein task
  • Competition features: Drop-in features for computational biology competitions
  • Visualization: Explore protein space with dimensionality reduction
  • Benchmark datasets: Evaluate protein representation methods

πŸ—‚οΈ Dataset Structure

Files

ESMC 300M Embeddings (3 shards, 3.43 GB each):

  • esmc_300m_embeddings-train-0000-of-0003.parquet
  • esmc_300m_embeddings-train-0001-of-0003.parquet
  • esmc_300m_embeddings-train-0002-of-0003.parquet

ESMC 600M Embeddings (4 shards, 3.71 GB each):

  • esmc_600m_embeddings-train-0000-of-0004.parquet
  • esmc_600m_embeddings-train-0001-of-0004.parquet
  • esmc_600m_embeddings-train-0002-of-0004.parquet
  • esmc_600m_embeddings-train-0003-of-0004.parquet

Supporting Files:

  • sequences.parquet (32 MB) - Source protein sequences & metadata
  • skipped_sequences.txt (2.7 MB) - Filtered sequences log
Dataset Shards Size per Shard Total Size Total Rows
ESMC 300M 3 ~3.43 GB ~10.3 GB 7,087,560
ESMC 600M 4 ~3.71 GB ~14.8 GB 8,505,072
Sequences 1 32 MB 32 MB 236,252
Total 8 - ~25.7 GB -

Why Sharded?

Files are split into ~3.5 GB shards for optimal performance:

  • βœ… Faster downloads: Parallel shard downloads
  • βœ… Memory efficient: Stream one shard at a time
  • βœ… HuggingFace optimized: Automatic shard handling with datasets library
  • βœ… Resumable transfers: Failed downloads can resume individual shards

Schema

Embeddings files (long format: one row per sequence-layer):

{
    'sequence_id': str,           # e.g., "ENSP00000269305.4" (TP53)
    'layer_idx': int,             # 0-29 (300M) or 0-35 (600M)
    'mean_embedding': List[float], # 960-dim (300M) or 1152-dim (600M)
    'sequence_length': int        # Amino acids count
}

Sequences file:

{
    'sequence_id': str,      # Ensembl protein ID
    'sequence': str,         # Amino acid sequence (20 standard AAs)
    'sequence_length': int,  # Length in amino acids
    'description': str       # Full FASTA header with gene metadata
}

πŸš€ Quick Start

Option 1: HuggingFace Datasets Library (Recommended)

The datasets library automatically handles sharded files:

from datasets import load_dataset
import numpy as np

# Load 600M embeddings (all shards loaded automatically)
ds = load_dataset('biolm/human-proteome-esmc-embeddings', data_files='esmc_600m_embeddings-train-*.parquet')

# Access as pandas DataFrame
df = ds['train'].to_pandas()

# Filter to last layer only
last_layer = df[df['layer_idx'] == 35]
print(f"Loaded {len(last_layer):,} proteins Γ— 1152 dims")

Option 2: PyArrow (Memory Efficient)

Load specific shards or filter on-the-fly:

import pyarrow.parquet as pq
import pandas as pd
from glob import glob

# Load only last layer from all 600M shards
dfs = []
for shard_file in glob('esmc_600m_embeddings-train-*.parquet'):
    table = pq.read_table(
        shard_file,
        filters=[('layer_idx', '==', 35)]  # Last layer only
    )
    dfs.append(table.to_pandas())

df = pd.concat(dfs, ignore_index=True)
print(f"Loaded {len(df):,} protein embeddings")  # 236,252 proteins

Option 3: Polars (Fastest)

import polars as pl

# Lazy load all 600M shards with glob pattern
df = pl.scan_parquet('esmc_600m_embeddings-train-*.parquet')

# Filter and collect efficiently
last_layer = df.filter(pl.col('layer_idx') == 35).collect()
print(f"Shape: {last_layer.shape}")  # (236252, 4)

Load Specific Proteins

import pandas as pd

# Load all shards and filter to specific proteins
df = pd.concat([
    pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
    for i in range(4)
], ignore_index=True)

# Get TP53 tumor suppressor embeddings (all 36 layers)
tp53_data = df[df['sequence_id'] == 'ENSP00000269305.4'].sort_values('layer_idx')
tp53_embeddings = np.array(tp53_data['mean_embedding'].tolist())
print(f"TP53 shape: {tp53_embeddings.shape}")  # (36, 1152)

Train a Classifier (Last Layer Only)

from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd

# Load only last layer from all shards
dfs = []
for i in range(4):  # 4 shards for 600M
    df = pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
    dfs.append(df[df['layer_idx'] == 35])

embeddings_df = pd.concat(dfs, ignore_index=True)

# Extract features
X = np.array(embeddings_df['mean_embedding'].tolist())  # (236252, 1152)
# y = your_labels  # e.g., GO terms, subcellular localization

clf = RandomForestClassifier()
clf.fit(X, y)

Protein Similarity Search

from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

# Load last layer from all shards
dfs = []
for i in range(4):
    df = pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
    dfs.append(df[df['layer_idx'] == 35])

df = pd.concat(dfs, ignore_index=True)

# Query: Find proteins similar to TP53
query_emb = df[df['sequence_id'] == 'ENSP00000269305.4']['mean_embedding'].iloc[0]
all_embs = np.array(df['mean_embedding'].tolist())

similarities = cosine_similarity([query_emb], all_embs)[0]
top_10_indices = similarities.argsort()[-11:-1][::-1]

print("Top 10 proteins similar to TP53:")
for idx in top_10_indices:
    seq_id = df.iloc[idx]['sequence_id']
    sim = similarities[idx]
    print(f"  {seq_id}: {sim:.4f}")

Join with Sequences

import pandas as pd

# Load embeddings (last layer only)
embeddings = pd.concat([
    pd.read_parquet(f'esmc_600m_embeddings-train-{i:04d}-of-0004.parquet')
    for i in range(4)
], ignore_index=True)
embeddings = embeddings[embeddings['layer_idx'] == 35]

# Load sequences
sequences = pd.read_parquet('sequences.parquet')

# Merge
merged = embeddings.merge(sequences, on='sequence_id', how='left')
print(f"Merged shape: {merged.shape}")
print(f"Columns: {merged.columns.tolist()}")

πŸ“ˆ Dataset Statistics

Coverage

  • Source: Homo sapiens GRCh38 peptide sequences from Ensembl
  • Total in source: 245,535 sequences
  • Processed: 236,252 sequences (96.2%)
  • Filtered: 9,283 sequences (3.8% - containing ambiguous/invalid amino acids)

Sequence Characteristics

  • Length range: 1 - 35,991 amino acids
  • Mean length: ~460 AA
  • Median length: ~282 AA
  • Valid amino acids: 20 standard (ACDEFGHIKLMNPQRSTVWY)

Model Comparison

Model Params Layers Embed Dim Shards Total Size Total Rows
ESMC 300M 300M 30 960 3 10.3 GB 7,087,560
ESMC 600M 600M 36 1152 4 14.8 GB 8,505,072

πŸ”¬ Generation Details

Models

  • ESMC 300M: EvolutionaryScale/esmc-300m-2024-12 (revision: a19d363)
  • ESMC 600M: EvolutionaryScale/esmc-600m-2024-12 (revision: d11cc14)
  • Library: ESMC v3.1.3 from EvolutionaryScale

Processing Pipeline

  1. βœ… Tokenize sequences with BOS/EOS tokens
  2. βœ… Forward pass through all layers (model.eval(), torch.no_grad())
  3. βœ… Remove BOS/EOS tokens from outputs
  4. βœ… Mean pool across sequence length dimension
  5. βœ… Extract to CPU as float32

Configuration

  • Batch size: Adaptive (8 for ≀4096 AA, 1 for longer sequences)
  • Max length: 50,000 amino acids
  • Random seed: 42 (reproducible)
  • Hardware: NVIDIA RTX A6000 (48GB VRAM)
  • Quality checks: βœ… No missing values, βœ… Correct layer counts, βœ… No duplicates
  • Sharding: Split to ~3.5 GB per shard for optimal HuggingFace compatibility

❓ FAQ

Q: Which layer should I use? A: The last layer (29 for 300M, 35 for 600M) typically works best for downstream tasks. Some applications benefit from intermediate layers or combining multiple layers.

Q: How do I load all shards at once? A: Use glob patterns with pandas/polars:

import pandas as pd
df = pd.concat([
    pd.read_parquet(f) for f in glob('esmc_600m_embeddings-train-*.parquet')
], ignore_index=True)

Or use HuggingFace datasets library which handles shards automatically.

Q: Can I load just one shard? A: Yes! Each shard is independent and contains a subset of proteins. Useful for memory-constrained environments or parallel processing.

Q: 300M vs 600M - which to use? A: 600M is larger and may capture more nuanced patterns. 300M is faster to work with. We recommend trying both!

Q: Are embeddings normalized? A: No, these are raw mean-pooled embeddings. Apply L2 normalization if needed for cosine similarity.

Q: What sequences were filtered out? A: 9,283 sequences (3.8%) containing non-standard amino acids:

  • X (ambiguous): 9,049 sequences
  • * (stop codon): 152 sequences
  • U (selenocysteine): 89 sequences

Q: Can I use this commercially? A: Yes! Under CC BY 4.0 license - free to use with attribution to BioLM.ai.

Q: How are proteins distributed across shards? A: Proteins are split sequentially (by row order) across shards. To get all layers for a protein, you may need to check all shards (though typically a protein's layers are in the same shard).

Q: Which shard contains a specific protein? A: Load the sequences.parquet file to see all sequence IDs, then search each shard. Or use the HuggingFace datasets library which handles this automatically.

πŸ“š Citation

If you use this dataset in your work, please cite:

@dataset{biolm_human_proteome_esmc_2025,
  title={Human Proteome ESMC Embeddings},
  author={BioLM.ai},
  year={2025},
  month={October},
  publisher={HuggingFace},
  url={https://huggingface.co/datasets/biolm/human-proteome-esmc-embeddings}
}

And the ESMC model:

@article{esmc2024,
  title={Evolutionary Scale Modeling: Protein Language Models},
  author={EvolutionaryScale},
  year={2024},
  url={https://github.com/evolutionaryscale/esm}
}

πŸ“„ License

CC BY 4.0 - Free to use with attribution to BioLM.ai

  • Source data (Ensembl): Freely available
  • ESMC models: Apache 2.0
  • This dataset: CC BY 4.0

πŸ™ Acknowledgments

  • EvolutionaryScale for developing and open-sourcing ESMC models
  • Ensembl for curating and maintaining the human proteome reference
  • HuggingFace for hosting and serving this dataset

πŸ“ž Contact & Support

  • Organization: BioLM.ai
  • Python SDK: py-biolm - Run inference on ESMC and many other biosequence models via API
  • HuggingFace Discussions: Use the Community tab for questions and feedback
  • Issues: Report problems via HuggingFace Discussions

Version: 1.0.0 | Last updated: October 2025 | Dataset size: ~26 GB (8 sharded files)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support