David Egea's picture

1 2 6

David Egea

David-Egea

·

https://github.com/David-Egea

David-Egea

AI & ML interests

NLP

Recent Activity

updated a dataset about 1 month ago

David-Egea/CWE-20-CFA

published a dataset about 1 month ago

David-Egea/CWE-20-CFA

liked a model about 1 year ago

de-Rodrigo/donut-merit

View all activity

Organizations

updated a dataset about 1 month ago

David-Egea/CWE-20-CFA

Updated Oct 11 • 44

published a dataset about 1 month ago

David-Egea/CWE-20-CFA

Updated Oct 11 • 44

liked 2 models about 1 year ago

de-Rodrigo/donut-merit

Image-Text-to-Text • 0.2B • Updated 21 days ago • 52 • 1

de-Rodrigo/idefics2-merit

Image-Text-to-Text • Updated Sep 16 • 1

upvoted a paper about 1 year ago

The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts

Paper • 2409.00447 • Published Aug 31, 2024 • 3

upvoted a collection about 1 year ago

PhD

Synthetic Multimodal-Datasets Generation • 4 items • Updated Sep 4, 2024 • 1

reacted to de-Rodrigo's post with 🔥 about 1 year ago

Post

1360

A few weeks ago, we uploaded the MERIT Dataset 🎒📃🏆 into Hugging Face 🤗!

Now, we are excited to share the Merit Dataset paper via arXiv! 📃💫
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts (2409.00447)

The MERIT Dataset is a fully synthetic, labeled dataset created for training and benchmarking LLMs on Visually Rich Document Understanding tasks. It is also designed to help detect biases and improve interpretability in LLMs, where we are actively working. 🔧🔨

MERIT contains synthetically rendered students' transcripts of records from different schools in English and Spanish. We plan to expand the dataset into different contexts (synth medical/insurance documents, synth IDS, etc.) Want to collaborate? Do you have any feedback? 🧐

Resources:

- Dataset: de-Rodrigo/merit
- Code and generation pipeline: https://github.com/nachoDRT/MERIT-Dataset

PD: We are grateful to Hugging Face 🤗 for providing the fantastic tools and resources we find in the platform and, more specifically, to @nielsr for sharing the fine-tuning/inference scripts we have used in our benchmark.

liked a dataset over 1 year ago

de-Rodrigo/merit

Viewer • Updated Jul 15 • 478k • 7.94k • 6

updated a model over 1 year ago

David-Egea/bert-small-phishing

Text Classification • 28.8M • Updated Apr 9, 2024 • 3 • 1

liked 2 models over 1 year ago

David-Egea/bert-small-phishing

Text Classification • 28.8M • Updated Apr 9, 2024 • 3 • 1

CICLAB-Comillas/BARTSumpson

Updated Jul 6, 2023 • 1

liked a dataset over 1 year ago

David-Egea/phishing-texts

Viewer • Updated Mar 28, 2024 • 20.3k • 31 • 2

updated a dataset over 1 year ago

David-Egea/phishing-texts

Viewer • Updated Mar 28, 2024 • 20.3k • 31 • 2

updated a dataset almost 2 years ago

David-Egea/Creditcard-fraud-detection

Viewer • Updated Feb 12, 2024 • 285k • 90 • 1

updated a dataset over 2 years ago

CICLAB-Comillas/calls_10k_v1

Updated Jun 29, 2023 • 31