Common Crawl Foundation

Enterprise

non-profit

Verified

https://commoncrawl.org

commoncrawl

Activity Feed

AI & ML interests

Crawled data and metadata

Recent Activity

malteos updated a dataset 3 days ago

commoncrawl/statistics

greglindahl updated a dataset 7 days ago

commoncrawl/web-graph-testing-v1

greglindahl published a dataset 7 days ago

commoncrawl/web-graph-testing-v1

View all activity

malteos

updated a dataset 3 days ago

commoncrawl/statistics

Viewer • Updated 3 days ago • 600k • 375 • 25

greglindahl

updated a dataset 7 days ago

commoncrawl/web-graph-testing-v1

Updated 7 days ago • 6

greglindahl

published a dataset 7 days ago

commoncrawl/web-graph-testing-v1

Updated 7 days ago • 6

handecelikkanat

updated a dataset about 1 month ago

commoncrawl/citations

Viewer • Updated Oct 16 • 9.18k • 109 • 1

laurievb

updated a dataset 2 months ago

commoncrawl/statistics

Viewer • Updated 3 days ago • 600k • 375 • 25

tvaughan

updated a dataset 3 months ago

commoncrawl/statistics

Viewer • Updated 3 days ago • 600k • 375 • 25

laurievb

authored 3 papers 8 months ago

An Open Dataset and Model for Language Identification

Paper • 2305.13820 • Published May 23, 2023

The University of Edinburgh's Submission to the WMT22 Code-Mixing Shared Task (MixMT)

Paper • 2210.11309 • Published Oct 20, 2022

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Paper • 2503.10267 • Published Mar 13 • 2

malteos

authored 10 papers 9 months ago

Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings

Paper • 2202.06671 • Published Feb 14, 2022 • 2

Specialized Document Embeddings for Aspect-based Similarity of Research Papers

Paper • 2203.14541 • Published Mar 28, 2022

Investigating Gender Bias in Turkish Language Models

Paper • 2404.11726 • Published Apr 17, 2024 • 1

Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning

Paper • 2301.09626 • Published Jan 23, 2023 • 2

Progress Report: Towards European LLMs

Paper • 2410.03730 • Published Sep 30, 2024 • 3

Data Processing for the OpenGPT-X Model Family

Paper • 2410.08800 • Published Oct 11, 2024 • 1

MMTEB: Massive Multilingual Text Embedding Benchmark

Paper • 2502.13595 • Published Feb 19 • 42

greglindahl

authored a paper 10 months ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published Jan 14 • 63

AI & ML interests

Recent Activity

Team members 13

commoncrawl's activity