arXiv:2511.00839

CodeClash: Benchmarking Goal-Oriented Software Engineering

Published on Nov 2

· Submitted by

John on Nov 5

Stanford NLP

Upvote

Authors:

John Yang ,

Abstract

CodeClash evaluates language models' ability to iteratively develop code for open-ended objectives through competitive multi-round tournaments.

AI-generated summary

Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Whether it's writing notes, scrutinizing documentation, analyzing competition logs, or creating test suites, models must decide for themselves how to improve their codebases both absolutely and against their opponents. We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant. These limitations are stark: top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.

View arXiv page View PDF Project page GitHub 23 Add to collection

Community

john-b-yang

Paper author Paper submitter 1 day ago

Code evaluations for LMs (e.g. HumanEval, SWE-bench) are heavily task-oriented - "implement a function", "fix a bug", "write a test".

We tell LMs exactly what we want them to do, and grade them on unit test correctness.

But this type of framing overlooks a significant aspect of software engineer. When we write code and develop impressive software systems, we are driven by goals. High level objectives (e.g., improve user retention, reduce costs, increase revenue) fundamentally motivate why we build in the first place.

What if we had a coding evaluation that captured this dynamism of real world software dev?

We're excited to share our attempt at benchmarking goal-driven software engineering - CodeClash!

In CodeClash, 2+ models compete to build the best codebase for achieving a high level objective over the course of a multi-round tournament.

Each round has two phases:

Edit Phase: Models are allowed to change their codebases however they'd like.
Competition Phase: Their codebases compete in a code arena

The LM that wins the majority of rounds is declared winner.

Thanks for reading! If this is exciting, please check out the paper and our website for all the details!

https://codeclash.ai/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.00839 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.00839 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.00839 in a Space README.md to link it from this page.