Testing

Three Key Elements to Incorporate into Your Flaky Test Remediation Approach

August 29, 2023
2565 Unique Views
5 min read

Table of Contents

1. Deploy best practice strategies
2. Align your process & resources
3. Understand the causes of flaky tests
Conclusion & next steps

Flaky tests pose substantial challenges due to their unpredictable and inconsistent nature. Effectively addressing them requires a multi-faceted approach that involves the effective integration of strategy, process and resource alignment, and a deep understanding of flaky test causality. This post will walk you through this approach.

Note! This post is part of a three-part series. If you’re not sure it’s worth remediating flaky tests, read Part 1: Seven Reasons You Should Not Ignore Flaky Tests. Read Part 2 to understand the keys to identifying and tracking flaky tests, called 5 Ways to Use Gradle Enterprise to Identify and Manage Flaky Tests. Now, let’s explore my multi-faceted approach to fixing flaky tests.

1. Deploy best practice strategies

Once you have identified which of your tests are flaky, you can use one of these strategies to mitigate the problems they cause.

Screenshot of Gradle Enterprise test failures dashboard

1. Quarantine Flaky Tests: Isolate flaky tests to prevent them from disrupting the development process and distracting developers from genuine failures. Once quarantined, these tests can be analyzed separately, freeing developers to focus on legitimate failures.

2. Improve Error Reporting: You may need more information in order to find the causes of the test failures. Enhancing your error reporting can significantly aid in handling flaky tests. This can be achieved by adding assertions, checking preconditions, and logging more details about the test environment and state.

3. Retry with Care: While retrying can be a useful tool in identifying flaky tests, it’s not a strategy for solving the problem. Retrying until a test passes masks the intermittency and wastes resources in CI and locally.

4. Commit to Fixing Flaky Tests: Once you’ve tracked down the flaky tests, and perhaps improved the error reporting and quarantined them from interfering with your team’s productivity, the goal should be to fix the test.

You can also read about how the Gradle Build Tool team handles flaky tests.

2. Align your process & resources

If you don’t want to rely on a few individuals with the discipline and determination to fix your flaky tests, you’ll need to implement some process changes to make sure time is allocated to fixing the problems.

When a developer commits a change that breaks a test, the developer or developers who worked on that change usually start working on fixing that test. This is a well-accepted approach to fixing breaking changes, but it equally applies to tests that start to fail intermittently.

If your application already has a number of flaky tests that aren’t owned by a developer, you may want to schedule regular Flaky Test Days. These dedicated sessions not only aim to decrease the number of flaky tests in your test suite, they also emphasize the importance of addressing test flakiness, and foster a culture of collective responsibility toward improving test reliability.

3. Understand the causes of flaky tests

The causes of test intermittency are varied and nuanced, as discussed by Dave Farley in his video, 5 Reasons Automated Tests Fail, and collated in a research paper on the impact and causes of intermittent tests. Each test may be a unique case, but you may also find that one cause of intermittency affects multiple tests.

Here are some common causes of test intermittency. Note that these categories can overlap, but considering each failure from one of these angles may lead to identifying a fix for the failure.

1. Concurrency, Asynchronous Programming, and Waiting: Asynchronous and concurrent programming pose specific challenges to testing. Tests often have to wait for events to happen before taking the next steps or may run into race conditions in either the test code or production code.

There may be environmental factors in these failures too, since tests may time out more frequently if the test environment is under a high load.

Screenshot of the time taken to run the test over time

2. Environment, Network, and Resources: Variations in testing environments or network conditions, as well as insufficient compute resources, can result in inconsistent test behavior.

Gradle Enterprise can help you identify some of these issues—it will show details about the environment the tests ran in so you can compare the test results from different environments.

Screenshot of build scan's infrastructure page

3. Integration Points: Tests depending on external systems or services (integration points) may be flaky due to the unpredictable nature of these dependencies. This includes other services from inside your organisation, as well as third-party libraries and APIs or systems that are external to your organisation.

Integration tests against external systems are valuable since they can tell you if your assumptions about the system are correct. However, tests that are designed to run against systems that can change without warning should be kept separate from the main test suite due to their inherently uncertain behaviour.

And the main test suite should protect itself from these integration points by mocking and stubbing the expected behaviour.

Screenshot of results of integration test running

4. Setup/Teardown and Test Data: Test results can only be predictable if the start state of the test and end state of the test are also predictable. If the tests rely on shared state, shared data, or shared resources (like a database), this can be a contributor to intermittency in the tests. It’s key to make sure the tests run in isolation so they don’t impact the data from other tests.

Even when the data is isolated from other tests, you may still run into unpredictable results if your test or production code is something that’s randomly generated, or related to date and time. You may want to inject a custom provider of random values or date/time into your production code so that you can control these values from the test.

Screenshot of a test failure probably caused by two tests using the same resources

5. System Behavior: While it’s easy to assume it’s some problem with the environment or test data that’s causing an intermittent failure, sometimes the problems lie in the production code of your application.

For example, test environments can sometimes trigger genuine race conditions in concurrent code or uncover bugs in third-party libraries. These can sometimes be the most difficult issues to identify but are arguably the most important reason to address flaky tests.

Conclusion & next steps

Efficient management of flaky tests is a combination of strategic actions, process changes, and a deep understanding of root causes.

By weaving these elements together, your team can effectively navigate the challenges posed by flaky tests, ensuring the delivery of high-quality, reliable software.

Don’t Forget to Share This Post!

Testing

Trisha Gee

Author

Engineer, author, keynote speaker, developer champion, catalyst. Developer Advocate @ Gradle.

Testing an OpenRewrite Recipe

Foojay Podcast #75: JCON Report, Part 4 – Tips and Tricks for Java Devs

Data Modeling for Java Developers: Structuring With PostgreSQL and MongoDB

Creating Scalable OpenAI GPT Applications in Java

Clean and Modular Java: A Hexagonal Architecture Approach

Building a Real-Time AI Fraud Detection System with Spring Kafka and MongoDB

Dissection of Joeffice: Open Source Office Suite in Java

Prime Time: The High Performance Java Event

Project Panama for Newbies (Part 1)

How I Improved Zero-Shot Classification in Deep Java Library (DJL) OSS

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Java 17 on the Raspberry Pi

How to Create Mobile Apps with JavaFX (Part 1)

Project Panama for Newbies (Part 1)

Foojay Slack: bit.ly/join-foojay-slack

Beginning JavaFX Applications with IntelliJ IDE

SpringBoot 3.2 + CRaC

Debugging Java on the Command Line

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Three Key Elements to Incorporate into Your Flaky Test Remediation Approach

1. Deploy best practice strategies

2. Align your process & resources

3. Understand the causes of flaky tests

Conclusion & next steps

Trisha Gee

Trisha Gee

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

Comments (0)

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Do you want your ad here?

Three Key Elements to Incorporate into Your Flaky Test Remediation Approach

1. Deploy best practice strategies

2. Align your process & resources

3. Understand the causes of flaky tests

Conclusion & next steps

Trisha Gee

Trisha Gee

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

Do you want your ad here?

Related Articles

Comments (0)

Set Event Reminder

Subscribe to foojay updates:

Share with