New Java Benchmark for Coding LLMs puts GPT-5 at the top

August 18, 2025
4121 Unique Views
6 min read

Table of Contents

GPT-5 is on top at every performance level and every price point
… but it’s no speed demon
Performance by task length
Other observations
Implications for builders
A Note on Reasoning

Introducing the Brokk Power Ranking

The Brokk Power Ranking is a new open-source coding benchmark, featuring 93 tasks from large, real-world codebases. You can check out the current Power Ranking here.

Why a New Benchmark?

SWE-bench is the closest thing we have to a standard, objective benchmark for LLM coding performance, but it has a bunch of issues, the largest of which are that it’s Python-only, and it’s old enough that almost certainly some labs are now training to the test. (Epoch AI has a great writeup on the more subtle problems with SWE-bench if you want to go deeper.)

As Jack Morris put it:

But if the big labs have better benchmarks they haven’t released them, so it has fallen to small teams like Aider and now Brokk to step up and move the industry forward with independent evaluations.

Findings

GPT-5 is on top at every performance level and every price point

We’ve seen commentary elsewhere about GPT-5 underwhelming vs expectations and we couldn’t disagree more. OpenAI came out hard with the GPT-5 release. Not only does it have chart-topping performance, but also killer value. They even quietly improved their prefix cache discount from 4x to 10x.

This is a release worthy of the GPT name that puts OpenAI back on top in every possible category, dominating the pareto frontier at the top:

And at the low end:

… but it’s no speed demon

The one chink in OpenAI’s armor is that, at least as of release week, the entire GPT-5 family is slow. The only slower model in the A or B tiers is Gemini 2.5 Pro. By contrast, Sonnet 4 is screaming fast. Even Opus 4.1 is faster than GPT-5 Mini.

And if we drop down to the Open Round, GPT-5 nano is both slower and dumber than Flash 2.5 – which, to be fair, is 5x more expensive. The big surprise is GPT OSS (as served by Fireworks) showing up as significantly faster than nano.

Performance by task length

The Power Ranking tasks are up to 108k tokens long. All the models get worse at larger tasks, with the top-performing models at large tasks (GPT-5 and Gemini 2.5 Pro) falling off more slowly than the rest, but still falling off.

This also means that context length isn’t the primary reason that newer models are passing former value king DeepSeek’s offerings. Yes, V3 was only able to solve one task over 32k tokens long, but the other models in its class fall off sharply here, too. So if we look at scores counting only tasks under that threshold, V3 edges closer to Flash 2.5 but the relative rankings are unchanged.

Other observations

Unlike in SWEBench, enabling thinking makes a meaningful difference in Opus 4.1 performance in the Power Ranking. We speculate that this may be due to the larger, more complex tasks involved.
But almost no models benefited from high thinking over default or medium. o4 is the exception that does benefit; o3, Sonnet 4, Opus 4.1, Gemini Pro 2.5, and Gemini Flash 2.5 all saw negligible benefit, or even worse performance from overthinking. (We were unable to get reasoning=high working with GPT-5 through the litellm proxy that Brokk uses in this initial test. We will update the results when we have solved the problem.)
The Chinese models (DeepSeek-V3, Kimi K2, and Qwen3 Coder) all did much worse than they did on SWEBench and AiderBench. This difference is especially pronounced with K2, which was rewarded by a spot in the D tier. V3 is handicapped by a context window small enough that it can’t handle some Power Ranking tasks at all, but the newer models don’t have that excuse. Were they trained on the test?
Grok 3 mini is one of the top low-cost performers in AiderBench, but D-tier in the Power Ranking. Probably this is because Brokk uses only a diff-based format; full-file replacement, which grok 3 mini was configured to use in AiderBench, is too slow to be practical in the real world–or the Power Ranking, where files often contain 1000s of lines of code (vs the dozens in AiderBench tasks).
We were unable to get a reasonable API quota from xAI to test Grok 4.
Quantization matters. Qwen3 Coder fp8 scores significantly worse on both average and best scores than the native fp16 version. By default, you could get either one from Openrouter, or even fp4, so be careful to configure this correctly!

Implications for builders

Build noise matters. No model did well with JGit in particular until we special-cased its Maven build to be less noisy. Besides the standard -quiet mvn flag, we added -DskipScriptExecution=true to mvn and a special BRK_SUPPRESS_STDERR flag to the test harness to keep the pollution down to a level that they weren’t overwhelming the build results with chaff.
Test quality also matters. If you have a flaky test that sometimes passes and sometimes fails, it will confuse the hell out of your LLM assistant. It’s pretty confusing for humans, too. Fix your tests!

Under the hood

We wanted to build a benchmark that wasn’t saturated, that would have meaningful gaps between the models, even at the top. We also wanted to be able to tell the difference between models down a rung, at the “intelligence too cheap to measure” level or close to it. We were able to deliver on both counts.

You can think of the Power Ranking as halfway between AiderBench (“toy” problems, usually in a single file) and SWEBench (“here’s the repo and the issue description, good luck”). Like SWEBench, the Power Ranking uses real code from real repositories. But like AiderBench, the Power Ranking tells the model which files it will need to edit (but not which external-to-those-files APIs it will need, it is left to the model to determine those using Brokk’s Scan Project).

Brokk’s tasks are significantly larger in scope than either AiderBench’s or SWEBench’s:

To build the Power Rank tasks, we looked at all the commits from the last six months in these five projects that had nontrivial changes AND test coverage, and reverse engineered them to a problem description:

Brokk
JGit
LangChain4j
Apache Cassandra
Apache Lucene

We tested all models against the Brokk tasks (the "Open Round") and the top performers against all tasks ("Finalists").

Unlike SWEBench, we did not rely on issue descriptions, which have the problem of being simultaneously too vague (“fix the race condition in the indexing system”) and too specific (giving too much of the solution in the description, which has been a problem in SWEBench).

The models tackle each task (example) in a single run of Brokk’s edit+test loop. That is, the model attempts to solve the task, Brokk runs the tests and gives any errors back to the model, and it gets to try again up to 5 times.

5 is the “bend in the knee” past which we saw significantly less success. That is, when we experimented with letting models try until they self-evaluated as not making progress (or until they ran out of context window) we saw very few successes past 5, indicating that either the model did not know how to use an API correctly and was guessing, or simply wedged itself into a corner that it didn’t know how to get out of.

Interesting side note: the two models that were most stubborn about continuing to try to solve a hopeless task were, by far, o3 and Qwen3 Coder, adding some evidence to the speculation that Qwen3 was trained on o3 output.

To capture the difference between a model that succeeds on the first try and one that needs multiple iterations, we assign the score for each successful task as

score = 1.0 / log2(build_failures + 2)

The Brokk Power Ranking is open source at https://github.com/BrokkAI/powerrank and the full results are available here.

A Note on Reasoning

A bare model name indicates that it was run with default reasoning tokens for most thinking models; the exception is the Claude models, where the API defaults to no reasoning at all. These were set to “medium”, with the -nothink variants for thinking disabled.

What's next for the Power Ranking

We will update the Power Ranking every six months. If you know an actively maintained, open source Java repo that we should include in our task sources, give us a shout!

Don’t Forget to Share This Post!

Jonathan Ellis

Author

Jonathan is the founder of Brokk (https://brokk.ai). Brokk keeps LLMs on-task in million-line codebases by adding compiler-grade understanding of your code's structure and semantics. Jonathan is also the author of JVector, co-founder of DataStax, and the founding project chair of Apache Cassandra.

JC-AI Newsletter #10

Java 25: What’s New?

Atlas Searching with the Java Driver

Preparing for Spring Framework 7 and Spring Boot 4

Foojay Podcast #84: Developing Performant, Cost Efficient, and Eco-friendly Code

Unleashing the Power of Data Visualization: Introducing BX-Charts for BoxLang

Understanding MCP Through Raw STDIO Communication

BoxLang RSS : Full-Featured RSS/Atom Feed Module for BoxLang

The Art of Performance Tuning: Why Saving 30% in the Cloud Means Nothing if Your Code Wastes 1000× More

Service Layer Pattern in Java With Spring Boot

JC-AI Newsletter #10

Java 25: What’s New?

Atlas Searching with the Java Driver

Preparing for Spring Framework 7 and Spring Boot 4

Foojay Podcast #84: Developing Performant, Cost Efficient, and Eco-friendly Code

Unleashing the Power of Data Visualization: Introducing BX-Charts for BoxLang

Understanding MCP Through Raw STDIO Communication

BoxLang RSS : Full-Featured RSS/Atom Feed Module for BoxLang

The Art of Performance Tuning: Why Saving 30% in the Cloud Means Nothing if Your Code Wastes 1000× More

Service Layer Pattern in Java With Spring Boot

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Java 17 on the Raspberry Pi

Project Panama for Newbies (Part 1)

How to Create Mobile Apps with JavaFX (Part 1)

Beginning JavaFX Applications with IntelliJ IDE

Foojay Slack: bit.ly/join-foojay-slack

SpringBoot 3.2 + CRaC

Creating Scalable OpenAI GPT Applications in Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Brokk: AI for Large (Java) Codebases

Table of Contents Sidebar: Under the HoodRecommendationsWorking with GitSidebar: LLM ModelsThe Edit Loop There are two reasons that AI makes mistakes writing code: The LLM just isn’t smart enough to tackle the problem effectively, and it simply gets the answer …

May 30 4,5K

Jonathan Ellis

Java

Tools Developer Tools

5 Things You Probably Didn’t Know About Java Concurrency

Even while threads are helpful, they are dreadful to many developers. Here are five essential threading concepts for Java developers!

Mar 25 10,1K

A N M Bazlur Rahman

Java Core

5 Tips to Create Secure Docker Images for Java Applications

Docker is the most widely used way to containerize your application. With Docker Hub, it is easy to create and pull pre-created images.

Dec 25 7,4K

Brian Vermeer

Security

6 Considerations when Building High-Performance Java Microservices with EDA

Renowned for its resilience and low latency, EDA is a reliable choice for developing robust, high-performing microservices.

Aug 16 6,5K

Rob Austin

Java Core

JavaFX Developer Tools Java Chronicle Software

7 Reasons Why, After 26 Years, Java Still Makes Sense!

After many discussions with Java developers, combined with my personal experiences with the Java community and platform, here are the key reasons why Java developers love Java after all these years!

Mar 15 28,2K

A N M Bazlur Rahman

Java Core

Opinion Java Beginner

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

New Java Benchmark for Coding LLMs puts GPT-5 at the top

Introducing the Brokk Power Ranking

Why a New Benchmark?

Findings