Activity Feed

AI & ML interests

None defined yet.

Recent Activity

theo-michelย  updated a Space about 17 hours ago
Mistral-AI-Game-Jam/shyguys_2
fracapuanoย  authored a paper about 2 months ago
Robot Learning: A Tutorial
theo-michelย  updated a Space about 2 months ago
Mistral-AI-Game-Jam/shyguys_2
View all activity

nouamanetaziย 
posted an update about 1 month ago
view post
Post
3912
After training ๐’๐ฆ๐จ๐ฅ๐‹๐Œ๐Ÿ‘ on ๐Ÿ‘๐Ÿ–๐Ÿ’ ๐‡๐Ÿ๐ŸŽ๐ŸŽ๐ฌ for nearly a month, I've come to realize something most people overlook: ๐ข๐ง๐Ÿ๐ซ๐š๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ฎ๐ซ๐ž ๐ข๐ฌ ๐ญ๐ก๐ž ๐ฆ๐š๐ค๐ž-๐จ๐ซ-๐›๐ซ๐ž๐š๐ค ๐Ÿ๐š๐œ๐ญ๐จ๐ซ ๐ข๐ง ๐‹๐‹๐Œ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐ . ๐Ÿ”ฅ

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious ๐๐‚๐‚๐‹ ๐ž๐ซ๐ซ๐จ๐ซ๐ฌ, or when your expensive GPU cluster is running at ๐Ÿ”๐ŸŽ% ๐ž๐Ÿ๐Ÿ๐ข๐œ๐ข๐ž๐ง๐œ๐ฒ, the problem isn't your model. It's most probably a ๐ฆ๐ข๐ฌ๐ฎ๐ฌ๐ž ๐จ๐Ÿ ๐ญ๐ก๐ž ๐ก๐š๐ซ๐๐ฐ๐š๐ซ๐ž. ๐Ÿ› ๏ธ

Questions that seemed simple but had no clear answers: Why is ๐Œ๐จ๐„ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐ฌ๐ฅ๐จ๐ฐ๐ž๐ซ ๐ญ๐ก๐š๐ง ๐๐ž๐ง๐ฌ๐ž ๐ฆ๐จ๐๐ž๐ฅ๐ฌ? Which ๐๐‚๐‚๐‹ ๐Ÿ๐ฅ๐š๐ ๐ฌ should we actually set? How often should we checkpoint without killing throughput?

That's why we built ๐“๐ก๐ž ๐’๐ฆ๐จ๐ฅ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐๐ฅ๐š๐ฒ๐›๐จ๐จ๐ค ๐Ÿ“–: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the ๐ข๐ง๐Ÿ๐ซ๐š๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ฎ๐ซ๐ž ๐ฅ๐š๐ฒ๐ž๐ซ that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: ๐‡๐๐Œ๐Ÿ‘ ๐ก๐ข๐ญ๐ญ๐ข๐ง๐  ๐Ÿ‘ ๐“๐/๐ฌ, ๐๐•๐‹๐ข๐ง๐ค ๐Ÿ’.๐ŸŽ ๐ซ๐ž๐š๐œ๐ก๐ข๐ง๐  ๐Ÿ•๐Ÿ–๐Ÿ” ๐†๐/๐ฌ, ๐๐‚๐ˆ๐ž ๐†๐ž๐ง๐Ÿ’ ๐š๐ญ ๐Ÿ๐Ÿ’.๐Ÿ ๐†๐/๐ฌ. Then we ran collective operations across ๐Ÿ๐Ÿ๐Ÿ– ๐†๐๐”๐ฌ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from ๐Ÿ’๐Ÿ–๐ŸŽ ๐†๐/๐ฌ on a single node to ๐Ÿ‘๐Ÿ๐ŸŽ-๐Ÿ‘๐Ÿ“๐ŸŽ ๐†๐/๐ฌ across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

๐“๐ก๐ž ๐’๐ฆ๐จ๐ฅ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐๐ฅ๐š๐ฒ๐›๐จ๐จ๐ค: https://lnkd.in/e5MKXUHS

Shared with โค๏ธ by the HuggingFace team
Tonicย 
posted an update 2 months ago
Tonicย 
posted an update 3 months ago
view post
Post
753
COMPUTER CONTROL IS ON-DEVICE !

๐Ÿก๐Ÿค– 78 % of EU smart-home owners DONโ€™T trust cloud voice assistants.

So we killed the cloud.

Meet Extรฉ: a palm-sized Android device that sees, hears & speaks your language - 100 % offline, 0 % data sent anywhere.

๐Ÿ”“ We submitted our technologies for consideration to the Liquid AI hackathon.

๐Ÿ“Š Dataset: 79 k UI-action pairs on Hugging Face (largest Android-control corpus ever) Tonic/android-operator-episodes

โšก Model: 98 % task accuracy, 678MB compressed , fits on existing android devices ! Tonic/l-android-control

๐Ÿ›ค๏ธ Experiment Tracker : check out the training on our TrackioApp Tonic/l-android-control

๐ŸŽฎ Live Model Demo: Upload an Android Screenshot and instructions to see the model in action ! Tonic/l-operator-demo



Built in a garage, funded by pre-orders, no VC. Now weโ€™re scaling to 1 k installer units.

Weโ€™re giving 50 limited-edition prototypes to investors , installers & researchers who want to co-design the sovereign smart home.

๐Ÿ‘‡ Drop โ€œEUSKERAโ€ in the comments if you want an invite, tag a friend who still thinks Alexa is โ€œconvenient,โ€ and smash โ™ฅ๏ธ if AI should belong to people - not servers.
ยท
Tonicย 
posted an update 3 months ago
view post
Post
709
๐Ÿ™‹๐Ÿปโ€โ™‚๏ธ Hey there folks ,

Just wanted to annouce ๐ŸญSmolFactory : it's the quickest and best way to finetune SmolLM3 and GPT-OSS-20B on huggingface !

Basicaly it's an app you can run on huggingface by duplicating the space and running your training directly on huggingface GPUs .

It will help you basically select datasets and models, fine tune your model , make an experiment tracker you can use on your mobile phone , push all your model card and even automatically make a demo for you on huggingface so you can directly test it out when it's done !

check out the blog to learn more : https://huggingface.co/blog/Tonic/smolfactory

or just try the app directly :
Tonic/SmolFactory

you can vibe check the cool models I made :
French SmolLM3 : Tonic/Petite-LLM-3
Medical GPT-OSS : Tonic/med-gpt-oss-20b-demo

check out the model cards :
multilingual reasoner (gpt-oss) - Tonic/gpt-oss-20b-multilingual-reasoner
med-gpt-oss : Tonic/med-gpt-oss-20b
petite-elle-l-aime : Tonic/petite-elle-L-aime-3-sft

github repo if you like command line more than gradio : https://github.com/josephrp/smolfactory

drop some likes on these links it's really much appreciated !

feedback and PRs are welcome !
MikeDoesย 
posted an update 4 months ago
view post
Post
299
Are you sure the open-source LLM model you just downloaded is safe?

A recent paper on "Privacy Backdoors" reports a new vulnerability where pre-trained models can be poisoned before fine-tuning them. This is a serious challenge for everyone building on open-source AI.

Instead of just pointing out problems, we believe in finding better solutions. To understand this threat, the researchers needed to test their attack on realistic data structures. They needed a dataset that could effectively simulate a high-stakes privacy attack, and we're proud that our Ai4Privacy dataset was used to provide this crucial benchmark. The paper reports that for our complex dataset, the privacy leakage on a non-poisoned model was almost zero. After the backdoor attack, that number reportedly jumped to 87%.



Ai4Privacy dataset provided a realistic benchmark for their research. Our dataset, composed of synthetic identities, helped them demonstrate how a poisoned model could dramatically amplify privacy leakage.

This is why we champion open source: it enables the community to identify these issues and develop better, safer solutions together.

Kudos to the research team behind this study: Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, and Nicholas Carlini, Oregon State University, University of Maryland, Google DeepMind, and ELLIS Institute Tubingen & MPI Intelligent Systems.

๐Ÿ”— Read the research to understand this new challenge: https://arxiv.org/pdf/2404.01231

#DataPrivacy #AI #OpenSource #Anonymization #MachineLearning #Ai4Privacy #Worldslargestopensourceprivacydataset
Tonicย 
posted an update 4 months ago
MikeDoesย 
posted an update 4 months ago
view post
Post
305
When anonymizing data for LLMs, is replacing a name with XXXXX enough?

A great post by Franklin Cardenoso Fernandez argues that we can do better. While simple masking hides data, it often destroys the context that models need to perform well.

A more robust method is contextual anonymization, where PII is replaced with meaningful labels like [NAME] or [ADDRESS]. This protects privacy while preserving the data's structural integrity.

We were pleased to see our Ai4Privacy pii-masking-200k dataset featured in the article as a prime example of this best practice. Our dataset is designed to help developers implement this superior form of anonymization by providing tens of thousands of clear, labeled examples.

By enabling models to be trained on data that is both private and context-rich, we can build AI that is both smarter and safer. This is a core part of our mission.

What's your team's preferred method for data anonymization? Let's discuss best practices.

๐Ÿ”— Read Franklin's full analysis here: https://www.holisticai.com/blog/managing-personal-data-in-large-language-models

#DataPrivacy #Anonymization #ResponsibleAI #LLM #MachineLearning #AIEthics #Ai4Privacy #World's largest open privacy masking dataset
MikeDoesย 
posted an update 4 months ago
view post
Post
2021
๐Ÿ›ก๏ธ At Ai4Privacy, our goal is to empower researchers to build a safer AI ecosystem. Today, we're highlighting crucial research that does just that by exposing a new vulnerability.

The paper "Forget to Flourish" details a new model poisoning technique. It's a reminder that as we fine-tune LLMs, our anonymization and privacy strategies must evolve to counter increasingly sophisticated threats.

We're proud that the Ai4Privacy dataset was instrumental in this study. It served two key purposes:

Provided a Realistic Testbed: It gave the researchers access to a diverse set of synthetic and realistic PII samples in a safe, controlled environment.

Enabled Impactful Benchmarking: It allowed them to measure the actual effectiveness of their data extraction attack, proving it could compromise specific, high-value information.

This work reinforces our belief that progress in AI security is a community effort. By providing robust tools for benchmarking, we can collectively identify weaknesses and build stronger, more resilient systems. A huge congratulations to the authors on this important contribution.

๐Ÿ”— Read the full paper: https://arxiv.org/html/2408.17354v1

#OpenSource #DataPrivacy #LLM #Anonymization #AIsecurity #HuggingFace #Ai4Privacy #World's largest open privacy masking dataset
Tonicย 
posted an update 4 months ago
view post
Post
820
๐Ÿ‘‹ Hey there folks,

just submitted my plugin idea to the G-Assist Plugin Hackathon by @nvidia . Check it out, it's a great way to use a local SLA model on a windows machine to easily and locally get things done ! https://github.com/NVIDIA/G-Assist
Tonicย 
posted an update 5 months ago
view post
Post
636
๐Ÿ™‹๐Ÿปโ€โ™‚๏ธ Hey there folks ,

Yesterday , Nvidia released a reasoning model that beats o3 on science, math and coding !

Today you can try it out here : Tonic/Nvidia-OpenReasoning

hope you like it !
MikeDoesย 
posted an update 5 months ago
view post
Post
1146
In data privacy, 92% accuracy is not an A-grade. Privacy AI needs to be better.

That's the stark takeaway from a recent benchmark by Diego Mouriรฑo

(Making Science), who put today's top PII detection methods to the test on call center transcripts using the Ai4Privacy dataset.

They pitted cutting-edge LLMs (like GPT-4 & Gemini) against traditional systems (like Cloud DLPs). The results show that our trust in these tools might be misplaced.



๐Ÿ“Š The Hard Numbers:



Even top-tier LLMs peaked at a reported 92% accuracy, leaving a potential dangerous 8% gap where your customer's data can leak. They particularly struggled with basics like 'last names' and 'street addresses'.



The old guard? Traditional rule-based systems reportedly achieved a shocking 50% accuracy. A coin toss with your customers' privacy.


This tells us that for privacy tasks, off-the-shelf accuracy is a vanity metric. The real metric is the cost of a single failureโ€”one leaked name, one exposed address.



While no tool is perfect, some are better than others. Diegoโ€™s full analysis breaks down which models offer the best cost-to-accuracy balance in this flawed landscape. It's a must-read for anyone serious about building trustworthy AI.

#DataPrivacy #AI #LLM #RiskManagement #MetricsThatMatter #InfoSec

Find the full post here:
https://www.makingscience.com/blog/protecting-customer-privacy-how-to-remove-pii-from-call-center-transcripts/

Dataset:
ai4privacy/pii-masking-400k
Tonicย 
posted an update 5 months ago
view post
Post
3395
๐Ÿ™‹๐Ÿปโ€โ™‚๏ธ Normalize adding compute & runtime traces to your model cards
  • 2 replies
ยท
Tonicย 
posted an update 5 months ago
view post
Post
543
Who's going to Raise Summit in Paris Tomorrow ?

If you're around , I would love to meet you :-)
MikeDoesย 
posted an update 6 months ago
view post
Post
2719
Started aistatuscodes as a new project to create codes to understand AI performance better.

Going to be posting daily here and on instagram until we get to 100m downloads :)
https://www.instagram.com/MikeDoesDo/

Follow along the journey!
Tonicย 
posted an update 6 months ago
view post
Post
718
๐Ÿ™‹๐Ÿปโ€โ™‚๏ธ hey there folks ,

So every bio/med/chem meeting i go to i always the same questions "why are you sharing a gdrive link with me for this?" and "Do you have any plans to publish your model weights and datasets on huggingface?" and finally i got a good answer today which explains everything :

basically there is some kind of government censorship on this (usa, but i'm sure others too) and they are told they are not allowed as it is considered a "dataleak" which is illegal !!!!

this is terrible ! but the good news is that we can do something about it !

so there is this "call for opinions and comments" here from the NIH (usa) , and here we can make our opinion on this topic known : https://osp.od.nih.gov/comment-form-responsibly-developing-and-sharing-generative-artificial-intelligence-tools-using-nih-controlled-access-data/

kindly consider dropping your opinion and thoughts about this censorship of science , and share this post , link or thoughts widely .

Together maybe we can start to share data and model weights appropriately and openly in a good way ๐Ÿ™๐Ÿป๐Ÿš€

cc. @cyrilzakka

Tonicย 
posted an update 6 months ago
view post
Post
2570
๐Ÿ™‹๐Ÿปโ€โ™‚๏ธ Hey there folks ,

Yesterday the world's first "Learn to Vibe Code" application was released .

As vibe coding is the mainstream paradigm , so now the first educational app is there to support it .

You can try it out already :

https://vibe.takara.ai

and of course it's entirely open source, so i already made my issue and feature branch :-) ๐Ÿš€