BigScience Workshop

non-profit

https://bigscience.huggingface.co

bigscienceW

bigscience-workshop

Activity Feed

AI & ML interests

A one-year long research workshop on large language models: the Summer of Language Models 21 🌸

Recent Activity

RacBan authored a paper 10 days ago

The Art of Scaling Reinforcement Learning Compute for LLMs

vumichien authored a paper 12 days ago

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

vumichien authored a paper 12 days ago

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

View all activity

pminervini

authored 9 papers 10 days ago

An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering

Paper • 2503.23415 • Published Mar 30 • 1

MedDistant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction

Paper • 2204.04779 • Published Apr 10, 2022

PiCSAR: Probabilistic Confidence Selection And Ranking

Paper • 2508.21787 • Published Aug 29 • 4

Learning GUI Grounding with Spatial Reasoning from Visual Feedback

Paper • 2509.21552 • Published about 1 month ago • 11

giadap

posted an update 17 days ago

Post

4353

🌎 AI ethics and sustainability are two sides of the same coin.

In our new blog post with Dr. Sasha Luccioni, we argue that separating them (as is too often the case) means missing the bigger picture of how AI systems impact both people and the planet.

Ethical and sustainable AI development can’t be pursued in isolation. The same choices that affect who benefits or is harmed by AI systems also determine how much energy and resources they consume.

We explore how two key concepts, evaluation and transparency, can serve as bridges between these domains:

📊 Evaluation, by moving beyond accuracy or performance metrics to include environmental and social costs, as we’ve done with tools like the AI Energy Score.

🔍 Transparency, by enabling reproducibility, accountability, and environmental reporting through open tools like the Environmental Transparency Space.

AI systems mirror our priorities. If we separate ethics from sustainability, we risk building technologies that are efficient but unjust, or fair but unsustainable.

Read our blog post here: https://huggingface.co/blog/sasha/ethics-sustainability

AIEnergyScore/Leaderboard
sasha/environmental-transparency

1 reply

giadap

posted an update 27 days ago

Post

10821

One of the hardest challenges in AI safety is finding the right balance: how do we protect people from harm without undermining their agency? This tension is especially visible in conversational systems, where safeguards can sometimes feel more paternalistic than supportive.

In my latest piece for Hugging Face, I argue that open source and community-driven approaches offer a promising (though not exclusive) way forward.

✨ Transparency can make safety mechanisms into learning opportunities.
✨ Collaboration with diverse communities makes safeguards more relevant across contexts.
✨ Iteration in the open lets protections evolve rather than freeze into rigid, one-size-fits-all rules.

Of course, this isn’t a silver bullet. Top-down safety measures will still be necessary in some cases. But if we only rely on corporate control, we risk building systems that are safe at the expense of trust and autonomy.

Read the blog post here: https://huggingface.co/blog/giadap/preserving-agency

7 replies

monsoon-nlp

posted an update 30 days ago

Post

433

Bio LLMs train on many genomes, but can we encode differences within a species? TomatoTomato adds pangenome tokens to represent a domestic tomato and a wild tomato in one sequence 🍅 🧬
monsoon-nlp/tomatotomato-gLM2-150M-v0.1

HugoLaurencon

authored a paper about 1 month ago

ARE: Scaling Up Agent Environments and Evaluations

Paper • 2509.17158 • Published Sep 21 • 34

meg

posted an update about 1 month ago

Post

2842

🤖 As AI-generated content is shared in movies/TV/across the web, there's one simple low-hanging fruit 🍇 to help know what's real: Visible watermarks. With the Gradio team, I've made sure it's trivially easy to add this disclosure to images, video, chatbot text. See how: https://huggingface.co/blog/watermarking-with-gradio
Thanks to the code collab in particular from @abidlabs and Yuvraj Sharma.

jaketae

authored 4 papers about 2 months ago

What Language Model to Train if You Have One Million GPU Hours?

Paper • 2210.15424 • Published Oct 27, 2022 • 2

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Paper • 2211.05100 • Published Nov 9, 2022 • 34

Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies

Paper • 2305.12586 • Published May 21, 2023

TESS 2: A Large-Scale Generalist Diffusion Language Model

Paper • 2502.13917 • Published Feb 19 • 6

davanstrien

posted an update about 2 months ago

Post

949

I fine-tuned a smol VLM to generate specialized art history metadata!

https://huggingface.co/davanstrien/iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)

Trained with TRL + HF Jobs - single UV script, no GPU needed!

Space to explore predictions on a test set: davanstrien/iconclass-predictions

Blog soon!

giadap

posted an update about 2 months ago

Post

411

I've noticed something. While we're careful about what we post on social media, we're sharing our deepest and most intimate thoughts with AI chatbots -- health concerns, financial worries, relationship issues, business ideas...

With OpenAI hinting at ChatGPT advertising, this matters more than ever. Unlike banner ads, AI advertising happens within the conversation itself. Sponsors could subtly influence that relationship advice or financial guidance.

The good news? We have options.
🤝 Open source AI models let us keep conversations private, avoid surveillance-based business models, and build systems that actually serve users first.

Read more about it in our latest blog post, co-written with
@frimelle
https://huggingface.co/blog/giadap/privacy-conversational-ai

AI & ML interests

Recent Activity

Team members 328

bigscience's activity