Flaky Tests: a journey to beat them all

January 13, 2026
4 min read

Likes ...

Comments ...

Table of Contents

What’s a flaky test?First try: retry them all!Second try: fix them all!Third try: embrace the inevitability!Conclusion

“Sleep is not a synchronization primitive.”

Every test engineer, eventually

What’s a flaky test?

A flaky test is a test that sometimes passes and sometimes fails without any code changes. They’re the by‑product of non‑determinism: timing, concurrency, eventual consistency, network hiccups, clock drift, resource contention, and (our favorite) tests leaking state across runs.

Kestra is an open-source declarative orchestration platform designed to run, coordinate, and monitor large-scale, event-driven workflows. It is built to handle parallelism, asynchronous execution, and distributed systems at scale, exactly the kind of environment where determinism is hard and flaky tests tend to emerge.

At Kestra, we run 6,000+ tests across our repositories. We add dozens every day. If only 1% of those are flaky at 10% failure probability, you’ve got ~50 flaky tests. Expectation math says ~5 failures per CI run, good luck spotting real regressions under that noise.

As an orchestration platform, many of our tests execute parallel, asynchronous workflows. Async is powerful and naturally tricky to test: ordering isn’t guaranteed, and “eventually consistent” is not a helpful assertion.

One of our top issues is due to our queuing system; a test may receive a message from another test or miss a message from the queue. We strive to properly close the queue and handle all messages to ensure they are not leaked across tests, but it’s challenging to guarantee this.

Last year, CI was red often enough that we decided to go on a proper flake‑hunting journey.

First try: retry them all!

Our first try to bite them all was to retry the flaky tests.

Kestra is built in Java, and tests are written with the JUnit framework. The JUnit Pioneer extension contains an annotation that allows for retrying a test if it fails: @RetryingTest(5). We added this annotation to every test that often fails in our CI.

This helped… a bit. But it also inflated test times and masked real issues. Worse, some failures are structural (leaked resources, race conditions): once they fail, they keep failing, no matter how often you retry.
Verdict: good band‑aid, bad cure.

Second try: fix them all!

We then decided to put effort into fixing the failing test! We remove all the usage of the @RetryingTest(5) annotation and either fix the test or disable it.

As most of the flaky tests are tests that launch a workflow and assert on its execution. We improve our testing framework in this area to be sure that every test properly closes its resources and every workflow and execution created by a test will be deleted.

For that, we create a JUnit extension to manage test resource creation:

A @KestraTest annotation handles starting and closing the Kestra runner in the scope of a test class.
A @LoadFlows annotation handles loading and then removing flows in the scope of a test method.
A @ExecuteFlow annotation handles starting and then removing a flow execution in the scope of a test method.

Using this test framework everywhere gives us more control over resource allocation and deallocation, and allows us to clean any flow or execution created by a test to avoid possible test pollution with unrelated resources.

But after weeks of effort, we had to disable too many tests, and even if the number of flaky tests decreased, some were still failing, even rarely, but with the high number of tests we have, this would still make our CI suffer.

Third try: embrace the inevitability!

So tests will fail; we had to accept that, some pretty often, some rarely, but tests will fail.
We have to be pragmatic and embrace the inevitability of tests being flaky.

We decided to flag flaky tests and allow them to fail in the CI! This was not an easy decision as nobody wants to concede failure and accept it. But if we want to have a reliable CI without compromising test coverage and exploding testing implementation time, we have to avoid disabling tests and accept that some would fail pretty often.

To flag a flaky test, we annotate it with @FlakyTest which is a custom marker annotation that encapsulates Junit @Tag("flaky") annotation.
JUnit tags are very accurate for such use cases, they allow you to target a group of tagged tests when running your tests.

Our CI now launches tests in two steps:
First, tests non-tagged as flaky: those must pass for the CI run to be green
Then, tests tagged as flaky: those can fail

We also improve our CI to report differently standard tests and flaky tests, with a test summary in PR comment that directly contains the list of failing tests with their stack traces. This allows us to better pinpoint any test issues.

Of course, flagging a test as flaky is an easy thing to do, so we take care of first by trying to fix the test and only tag it as flaky as a last resort.
We have test observability in place to track flaky tests, so if they increase a lot, we would know.

Conclusion

You won’t beat every flaky test. That’s fine. The goal is to get reliable signals back into CI so you can confidently merge and ship. Separate what must be green from what’s allowed to wobble, invest in deterministic test lifecycles, and keep an eye on the flaky set so it doesn’t quietly grow.

Flakes are inevitable. Letting flakes dictate your delivery is optional.

Want to try Kestra? You can get started in 5 minutes following the quickstart guide.

January 13, 2026
4 min read

Likes ...

Comments ...

Java
Testing

Loic Mathieu

Author

Lead Software Engineer at Kestra.io | GCP GDE | Quarkus contributor | Book Author

Java 26 Is Here, And With It a Solid Foundation for the Future

Foojay Podcast #92: Java 26 Is Here: What’s New, What’s Gone, and Why It Matters in 2026

How is Leyden improving Java Performance? Part 1 of 3

Introduction to JVM Unified Logging (JEP-158 / JEP-271)

BoxLang 1.11.0 Release

DPoP: What It Is, How It Works, and Why Bearer Tokens Aren’t Enough

Language Learning Flashcard System – Part 1

Java 26: What’s New?

Building Java Microservices with the Repository Pattern

Spring Boot 4 OpenTelemetry Guide: Metrics, Traces, and Logs Explained

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Project Panama for Newbies (Part 1)

Java 17 on the Raspberry Pi

How to Create Mobile Apps with JavaFX (Part 1)

Beginning JavaFX Applications with IntelliJ IDE

SpringBoot 3.2 + CRaC

Foojay Slack: bit.ly/join-foojay-slack

Preparing for Spring Framework 7 and Spring Boot 4

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Flaky Tests: a journey to beat them all

What’s a flaky test?

First try: retry them all!

Second try: fix them all!

Third try: embrace the inevitability!

Conclusion

Loic Mathieu

Loic Mathieu

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Comments (0)

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Do you want your ad here?

Flaky Tests: a journey to beat them all

What’s a flaky test?

First try: retry them all!

Second try: fix them all!

Third try: embrace the inevitability!

Conclusion

Loic Mathieu

Loic Mathieu

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

All 0 Likes

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Do you want your ad here?

Related Articles

Comments (0)

Set Event Reminder

Subscribe to foojay updates:

Share with