When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity Paper ⢠2509.20293 ⢠Published Sep 24 ⢠7
When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity Paper ⢠2509.20293 ⢠Published Sep 24 ⢠7
When Do Neural Nets Outperform Boosted Trees on Tabular Data? Paper ⢠2305.02997 ⢠Published May 4, 2023
MARVIS: Modality Adaptive Reasoning over VISualizations Paper ⢠2507.01544 ⢠Published Jul 2 ⢠13
LiveBench: A Challenging, Contamination-Free LLM Benchmark Paper ⢠2406.19314 ⢠Published Jun 27, 2024 ⢠23
TuneTables: Context Optimization for Scalable Prior-Data Fitted Networks Paper ⢠2402.11137 ⢠Published Feb 17, 2024
ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models Paper ⢠2310.18208 ⢠Published Oct 27, 2023