PyBench: Evaluating LLM Agent on various real-world coding tasks Paper • 2407.16732 • Published Jul 23, 2024 • 1
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding Paper • 2508.21496 • Published Aug 29 • 54