Measuring what Matters: Construct Validity in Large Language Model Benchmarks Paper • 2511.04703 • Published Nov 3 • 7
Training language models to be warm and empathetic makes them less reliable and more sycophantic Paper • 2507.21919 • Published Jul 29 • 2
Clinical knowledge in LLMs does not translate to human interactions Paper • 2504.18919 • Published Apr 26 • 26