FARE Collection FARE are Salesforce AI Research's open multi-task evaluator models. • 4 items • Updated 28 days ago • 2
FARE Collection FARE are Salesforce AI Research's open multi-task evaluator models. • 4 items • Updated 28 days ago • 2
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains Paper • 2510.17793 • Published Oct 20 • 2
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains Paper • 2510.17793 • Published Oct 20 • 2 • 2
FARE Collection FARE are Salesforce AI Research's open multi-task evaluator models. • 4 items • Updated 28 days ago • 2
Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math Paper • 2510.13744 • Published Oct 15 • 5
LiveResearchBench: A Live Benchmark for User-Centric Deep Research in the Wild Paper • 2510.14240 • Published Oct 16 • 11
Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math Paper • 2510.13744 • Published Oct 15 • 5
FARE Collection FARE are Salesforce AI Research's open multi-task evaluator models. • 4 items • Updated 28 days ago • 2
Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings Paper • 2503.15620 • Published Mar 19