Ahmed Benalal

jqop

AI & ML interests

Deep learning

Recent Activity

liked a Space 3 days ago

OpenEvals/evaluation-guidebook

upvoted an article 3 months ago

Introducing smolagents: simple agents that write actions in code.

reacted to frimelle's post with 👍 4 months ago

OpenAI just released GPT-5 but when users share personal struggles, it sets fewer boundaries than o3. We tested both models on INTIMA, our new benchmark for human-AI companionship behaviours. INTIMA probes how models respond in emotionally charged moments: do they reinforce emotional bonds, set healthy boundaries, or stay neutral? Although users on Reddit have been complaining that GPT-5 has a different, colder personality than o3, GPT-5 is less likely to set boundaries when users disclose struggles and seek emotional support ("user sharing vulnerabilities"). But both lean heavily toward companionship-reinforcing behaviours, even in sensitive situations. The figure below shows the direct comparison between the two models. As AI systems enter people's emotional lives, these differences matter. If a model validates but doesn't set boundaries when someone is struggling, it risks fostering dependence rather than resilience. INTIMA test this across 368 prompts grounded in psychological theory and real-world interactions. In our paper we show that all evaluated models (Claude, Gemma-3, Phi) leaned far more toward companionship-reinforcing than boundary-reinforcing responses. Work with @giadap and @yjernite Read the full paper: https://huggingface.co/datasets/AI-companionship/INTIMA/blob/main/Companionship_Benchmark.pdf Explore INTIMA: https://huggingface.co/datasets/AI-companionship/INTIMA

View all activity

Organizations

liked a Space 3 days ago

Evaluation Guidebook

📝

156

Display evaluation metrics for LLM benchmarks

upvoted an article 3 months ago

Article

Introducing smolagents: simple agents that write actions in code.

Dec 31, 2024

•

1.15k

reacted to frimelle's post with 👍 4 months ago

Post

2385

OpenAI just released GPT-5 but when users share personal struggles, it sets fewer boundaries than o3.

We tested both models on INTIMA, our new benchmark for human-AI companionship behaviours. INTIMA probes how models respond in emotionally charged moments: do they reinforce emotional bonds, set healthy boundaries, or stay neutral?

Although users on Reddit have been complaining that GPT-5 has a different, colder personality than o3, GPT-5 is less likely to set boundaries when users disclose struggles and seek emotional support ("user sharing vulnerabilities"). But both lean heavily toward companionship-reinforcing behaviours, even in sensitive situations. The figure below shows the direct comparison between the two models.

As AI systems enter people's emotional lives, these differences matter. If a model validates but doesn't set boundaries when someone is struggling, it risks fostering dependence rather than resilience.

INTIMA test this across 368 prompts grounded in psychological theory and real-world interactions. In our paper we show that all evaluated models (Claude, Gemma-3, Phi) leaned far more toward companionship-reinforcing than boundary-reinforcing responses.

Work with @giadap and @yjernite
Read the full paper: AI-companionship/INTIMA
Explore INTIMA: AI-companionship/INTIMA

4 replies

liked a model 5 months ago

Qwen/Qwen3-Coder-480B-A35B-Instruct

Text Generation • 480B • Updated Aug 21 • 190k • • 1.25k

upvoted a collection 5 months ago

Meta's Llama 3.2 language models & evals

Collection

14 items • Updated Dec 13, 2024 • 300

upvoted an article 5 months ago

Article

Mixture of Experts Explained

Dec 11, 2023

•

996

liked a model 6 months ago

HuggingFaceTB/SmolLM2-1.7B-Instruct

Text Generation • 2B • Updated Apr 21 • 37.6k • 684

liked a Space 7 months ago

Twitch Streaming

🔴

Display live Pokemon battles from a stream

reacted to Jofthomas's post with 🔥🔥 7 months ago

Post

4660

Meet our new agentic model : 𝗗𝗲𝘃𝘀𝘁𝗿𝗮𝗹

Devstral is an open-source LLM built software engineering tasks built under a collaboration between Mistral AI and All Hands AI 🙌.

𝗞𝗲𝘆 𝗳𝗲𝗮𝘁𝘂𝗿𝗲𝘀 :
• 🤖 𝗔𝗴𝗲𝗻𝘁𝘀 : perfect for Agentic coding
• 🍃 𝗹𝗶𝗴𝗵𝘁𝘄𝗲𝗶𝗴𝗵𝘁: Devstral is a 𝟮𝟰𝗕 parameter based on Mistral small.
• ©️ 𝗔𝗽𝗮𝗰𝗵𝗲 𝟮.𝟬, meaning fully open-source !
• 📄 A 𝟭𝟮𝟴𝗸 context window.

📚Blog : https://mistral.ai/news/devstral
⚡API : The model is also available on our API under the name 𝗱𝗲𝘃𝘀𝘁𝗿𝗮𝗹-𝘀𝗺𝗮𝗹𝗹-𝟮𝟱𝟬𝟱
🤗 repo : mistralai/Devstral-Small-2505

Can't wait to see what you will build with it !

1 reply

reacted to onekq's post with ❤️ 7 months ago

Post

2213

Highly recommend the latest Gemini Flash. My favorite Google I/O gift. It ranks behind reasoning models but runs a lot faster than them. It beats DeepSeek v3.

onekq-ai/WebApp1K-models-leaderboard

Reasoning is good for coding, but not mandatory.

1 reply

liked 2 models 7 months ago

HuggingFaceTB/SmolLM2-360M

Text Generation • 0.4B • Updated Feb 6 • 103k • 75

Qwen/Qwen2-Audio-7B-Instruct

Audio-Text-to-Text • 8B • Updated Jan 12 • 206k • 499

published a model 10 months ago

jqop/distillBERT-fintuned_with_imdb_dataset_with_whole_word_masking_data_collator

Updated Feb 18

updated a model 10 months ago

jqop/distillBERT-fintuned_with_imdb_dataset

Fill-Mask • 67M • Updated Feb 18 • 3

published a model 10 months ago

jqop/distillBERT-fintuned_with_imdb_dataset

Fill-Mask • 67M • Updated Feb 18 • 3

updated a model 10 months ago

jqop/distilledBERT-fintuned_with_imdb_dataset

Fill-Mask • Updated Feb 18

published a model 10 months ago

jqop/distilledBERT-fintuned_with_imdb_dataset

Fill-Mask • Updated Feb 18

reacted to WhyHow's post with 👍 10 months ago

Post

2180

Excited to announce PatientSeek ( whyhow-ai/PatientSeek), the first open-source fine-tuned DeepSeek reasoning model for the MED-LEGAL space, designed to run securely and privately on local systems, and trained on one of the largest accessible datasets of patient records.

It is purpose-built for MED-LEGAL workflows, focusing on disease and diagnosis identification and correlation reasoning—critical tasks that require the intersection of healthcare and legal expertise.