Why Mirroring Production in Dev Helps You Avoid Costly Mistakes

July 28, 2025
490 Unique Views
9 min read

Table of Contents

The setup: A realistic aggregation scenarioThe application behind the testTesting on M0: The hidden riskTaking it to production: Same query, different outcomeReal-time metrics: Detecting the bottleneckQuery insights: The detective toolDon’t guess, let Performance Advisor show the wayResilience under pressure: Testing primary failover

Replication
Failover scenario
Simulating application load
Triggering the test in Atlas

Ready for production?

Many developers start building their applications with MongoDB using a free M0 cluster or a local environment. While this is common and convenient, it can lead to issues that could easily be avoided by using a more robust setup, such as a development cluster that closely mirrors production. Problems like inefficient queries, missing indexes, or even costly mistakes often go unnoticed in limited environments like M0 or local setups.

In this article, we’ll explore why your development environment should closely reflect your production environment, and how using an M10+ cluster with tools like Query Profiler and Performance Advisor can help you catch performance issues early and build with confidence.

To demonstrate this, we’ve built a small Java application that simulates a real-world scenario where the application needs to retrieve movie details, including the title, full plot, and their semantic vector representation, by combining data across collections using a multi-stage aggregation pipeline. We'll use it to show how MongoDB Atlas tools can uncover problems that might go unnoticed, until they impact production.

The setup: A realistic aggregation scenario

Let’s imagine your team is working on a new feature that queries movie data with some filtering and enrichment logic. The goal is to retrieve movie details, such as the title and full plot, and enrich them with vector embeddings stored in a related collection.

To represent this scenario, we built an aggregation query that joins the movies collection with embedded_movies to fetch embeddings, filters for records where the full plot mentions the word “snow,” and returns a projection with selected fields like title, year, fullplot, and the retrieved plot_embedding:

db.movies.aggregate([
 {
   $lookup: {
     from: "embedded_movies",
     localField: "title",
     foreignField: "title",
     as: "result"
   }
 },
 {
   $match: {
     fullplot: {
       $regex: "snow",
       $options: "i"
     }
   }
 },
 {
   $project: {
     title: 1,
     year: 1,
     fullplot: 1,
     plot_embedding: "$result.plot_embedding"
   }
 },
 {
   $sort: {
     year: -1
   }
 }
])

While this query works correctly, it's intentionally inefficient, designed to simulate a pattern that seems harmless in small datasets but can quickly become problematic as data grows. It includes multiple stages that increase resource usage and complexity, helping us observe how different environments respond under pressure.

One key detail is the use of a $regex filter on the fullplot field, a choice that can lead to slower performance in larger datasets. Although a more efficient solution would be to use MongoDB Atlas Search, we intentionally avoided it here. The goal is to highlight performance pitfalls that developers might face when relying on basic queries without deeper optimization.

The application behind the test

To turn this into a more practical scenario, we created a small Java application with an HTTP endpoint /enriched-details that triggers a method called getMovies. This endpoint executes the same aggregation we discussed earlier, allowing us to simulate how a real application would interact with the database and measure how long the query takes to run.

@GetMapping("/enriched-details")
public ResponseEntity<List<Document>> search(
      @RequestParam String plot
) {
   return ResponseEntity.ok(movieService.getMovies(plot));
}

The controller delegates to a service method where the aggregation is executed. The execution time is logged to help evaluate the impact of this query under different environments:

public List<Document> getMovies(String plot) {
   var start = System.currentTimeMillis();
   MongoCollection<Document> collection = mongoDatabase.getCollection("movies");


   ArrayList<Document> result = collection.aggregate(List.of(
         new Document("$lookup", new Document("from", "embedded_movies")
               .append("localField", "title")
               .append("foreignField", "title")
               .append("as", "result")),
         new Document("$match", new Document("fullplot", new Document("$regex", plot).append("$options", "i"))),
         new Document("$project", new Document("title", 1)
               .append("year", 1)
               .append("fullplot", 1)
               .append("plot_embedding", "$result.plot_embedding")),
         new Document("$sort", new Document("year", -1))
   )).into(new ArrayList<>());


   long duration = System.currentTimeMillis() - start;
   logger.info("Duration: {} ms", duration);
   return result;
}

Additionally, we implemented another endpoint that performs a simpler query by title and year. It helps us later demonstrate how even basic queries can benefit from proper indexing, especially in larger datasets:

@GetMapping("/by-title-year")
public ResponseEntity<List<Document>> findByTitleAndYear(
      @RequestParam String title,
      @RequestParam int year
) {
   return ResponseEntity.ok(movieService.findByTitleAndYear(title, year));
}

And then, the service code that performs the search:

public List<Document> findByTitleAndYear(String title, int year) {
   return getMoviesCollection()
         .find(new Document("title", title).append("year", year))
         .into(new ArrayList<>());
}

This simple setup makes it easy to test different queries in a controlled way, including both the aggregation with $lookup and the direct find by title and year.

The full source code is available on GitHub.

Testing on M0: The hidden risk

When tested against an M0 cluster, the application behaves normally. The response returns without errors, latency is acceptable, and from the app’s perspective, everything looks fine.

To try it yourself, first make sure to run the application and ensure your database contains the sample_mflix dataset (available in MongoDB Atlas as a preloaded sample dataset you can import with one click). Then, point the application to an M0 cluster using your connection string and call the following endpoints:

### Enriched movie details
GET http://localhost:8080/movies/enriched-details?plot=love

### Find movie by title and year
GET http://localhost:8080/movies/by-title-year?title=Titanic&year=1903

Technically speaking, both queries execute relatively fast,mainly because the dataset is still small. But, here's the catch:

M0 clusters have a 512MB storage limit, which directly impacts the amount of data we’re querying.

In our test environment: The M0 cluster contains just over 21,000 movie documents, staying well within the size limit. Because the M0 dataset is small, this complex aggregation runs fast enough to appear "safe." This gives a false sense of performance stability.

Now, you might be wondering: “How much data do I actually need to get meaningful test results?”

While there's no magic number, testing on a nearly empty database won't reveal much. As a practical rule of thumb, try to load at least 1-10% of your real dataset, or somewhere around 1GB or more, depending on your workload.

M0 clusters, with their 512MB cap, are perfect to get started. But, once your app is doing real work, you'll need more space and more visibility to catch what actually matters.

Taking it to production: Same query, different outcome

Let’s continue our scenario by imagining that the application has now been deployed to production. To simulate this environment more accurately, where the dataset is significantly larger and where most applications typically run on more powerful clusters, the application was moved to an M10 cluster.

Instead of the 21,000 movie documents we had on M0, the M10 environment now holds over 520,000 documents, which is more than 20x the original volume.

The exact same endpoints were executed, but this time, performance issues quickly surfaced. Latency increased, and response times became inconsistent.

What could be causing this? One of the first clues is the significant difference in data volume. With over 520,000 documents, the M10 cluster is processing far more data than the M0 environment, which naturally increases the query’s cost.

Real-time metrics: Detecting the bottleneck

To better understand the system’s behavior, the /movies/enriched-details endpoint was executed again, but this time, with MongoDB Atlas's Real-Time Performance panel open. That’s when a red flag appeared: CPU usage spiked to 100% during the request.

Atlas real-time metrics

On the bottom-right, we also see the slowest operations pointing to the movies collection, a strong indication that something in our aggregation was overloading the system. However, this alone doesn’t explain exactly what caused the spike.

Query insights: The detective tool

Following the clues, the next logical step is to open Query Insights, a tool that helps investigate performance issues in more detail. During the same time window, we can access the Query Profiler tab to view which operations took the longest to execute. There, we can often identify the query responsible for the high resource usage.

Query Profiler

As we can observe, between 18:00 and 19:00, the chart shows a clear spike in operation execution time. During this window, the duration of read operations on the movies collection increased quickly, starting from just a few milliseconds and reaching up to one minute.

Query Profiler

This sudden escalation confirms that the query began consuming significantly more resources, which aligns with the symptoms we observed earlier in the real-time metrics. The Query Profiler provided more concrete details: It shows the exact aggregation that was running, the total execution time (1.03min), and the number of documents examined:

Query Profiler

With this information in hand, we can move toward a proper optimization solution or even make code adjustments to prevent the issue.

Don’t guess, let Performance Advisor show the way

Continuing our analysis, we have the /by-title-year endpoint, which returns movies filtered by title and year. Once the application is live in production, this endpoint starts receiving several requests to look up specific movies. That’s when we notice the query isn’t optimized, and we might not even know exactly how to improve it.

This is where the Performance Advisor comes in. M10+ clusters offer this feature under the Performance menu, providing valuable suggestions based on real-world usage. Select your cluster and click on the Performance Advisor tab:

Performance Advisor recommendations

In our scenario, the Performance Advisor is suggesting the creation of an index for our cluster. By selecting View Recommendations, we can view the specific index suggestion that may help improve query performance:

Performance Advisor index suggestion

Atlas is suggesting the creation of the { title: 1, year: 1 } index on the movies collection, as it directly matches the queries executed by our endpoint.

The query is quite simple, and the need for an index might be easy to spot manually. But in more complex scenarios, the Performance Advisor becomes a powerful ally during development.

Resilience under pressure: Testing primary failover

A cluster configured in MongoDB Atlas operates as a replica set with three nodes. By default, there are two secondary nodes and one primary node, as shown in the image below:

MongoDB replica set

Replication

The primary node is responsible for handling all write operations. After each write, the data is automatically replicated to the secondaries, which may be deployed across different geographic regions.

This setup is especially valuable in scenarios involving backup and regional failures. Imagine, for instance, that the primary node is hosted in the São Paulo region, and that region becomes unavailable. In that case, the two secondary nodes in other regions still maintain up-to-date copies of the data.

Failover scenario

But what happens if the primary node fails? MongoDB will automatically initiate an election process among the secondaries to choose a new primary. Once elected, that node takes over all write operations. And the old primary? Once it recovers, it re-joins the replica set as a secondary node.

Although MongoDB handles this entire process automatically, it's equally important to ensure that your application also behaves as expected during a failover. That’s where M10+ clusters come in with the Primary Failover feature, which allows you to safely simulate a failover and observe how your application responds in real time.

The idea behind this feature is to force the current primary node to step down, triggering an election where a new primary is chosen. This simulates the failure of a node, helping you validate how well your application handles such an event.

Simulating application load

To ensure the application can handle a primary switch without interruptions, all we need is a simple method that performs both write and read operations to the database. These operations should continue running seamlessly, even during a primary failover. Here is the code:

int counter = 0;
while (true) {
   long startTime = System.currentTimeMillis();
   try {
       Document doc = new Document("counter", counter)
               .append("timestamp", new java.util.Date());
       collection.insertOne(doc);
       collection.find().sort(new Document("counter", -1)).first();
       long duration = System.currentTimeMillis() - startTime;
      logger.info("{}", String.format(
       "Attempt #%d → Write & Read completed in %dms%s",
       counter, duration,
       duration > 5000 ? " (This is slower than expected)" : ""
));
       counter++;
   } catch (Exception e) {
       long duration = System.currentTimeMillis() - startTime;
       System.out.printf("FAIL #%d - %dms - %s%n", counter, duration, e.getMessage());
   }
   Thread.sleep(1000);
}

This loop runs continuously, inserting and reading documents until the application is manually stopped. It’s just a temporary setup designed for experimentation, to confirm that the application continues writing and reading data during the failover process, not production-ready, but enough to validate behavior during failover.

Triggering the test in Atlas

While the loop is running, open your MongoDB Atlas dashboard. Find the cluster you want to test, click the ⋮ (three dots) menu, and select Test Resilience. This will initiate the failover process and let you observe how your application behaves in real time.

MongoDB Test Primary Failover

Once activated, simply observe your application logs to check for any anomalies or unexpected failures during the process.

Application log

Throughout the entire process, your application should continue performing reads and writes normally. The goal is to ensure that your application handles replica set transition correctly, without requiring restarts or manual intervention.

What to look for:

No unexpected exceptions in the logs.
Guarantee that read/write operations works normally. It could be slow at some point but they should recover automatically.
No need to restart connections, or the app.

If you want to go one step further, implement basic error handling around your MongoDB operations and add retry logic to validate that the app stays resilient in edge cases.

These kinds of tests are essential to ensure that the application continues to operate as expected.

Ready for production?

This experiment highlights the importance of simulating real-world conditions during development. Tools like Real-Time Performance Panel, Query Profiler, and Performance Advisor aren’t just nice to have. They’re essential for building with confidence.

The main lesson is clear: It’s not enough for your application or queries to just “work.” They need to be observable and validated in environments that reflect production. Identifying issues early can prevent costly surprises later, making the investment in a stronger dev environment well worth it.

If you have any questions, feel free to leave them in the comments. The complete code is available on the mongo-developer GitHub.

Don’t Forget to Share This Post!

Ricardo Mello

Author

Why Mirroring Production in Dev Helps You Avoid Costly Mistakes

Spring Cloud Stream: Event-Driven Architecture – Part 1

CodeRabbit Tutorial for Java Developers

Foojay Podcast #77: DevBcn Report, Part 2 – Spanish Knowledge Sharing

Java 22 to 24: Level up your Java Code by embracing new features in a safe way

MongoDB ACID Transactions With Java

Introducing a New Java DMX512 Library With Demo JavaFX User Interface

Creating Scalable OpenAI GPT Applications in Java

Agent Memory with Spring AI & Redis

Understanding MCP Through Raw STDIO Communication

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Java 17 on the Raspberry Pi

Project Panama for Newbies (Part 1)

How to Create Mobile Apps with JavaFX (Part 1)

Foojay Slack: bit.ly/join-foojay-slack

Beginning JavaFX Applications with IntelliJ IDE

SpringBoot 3.2 + CRaC

Debugging Java on the Command Line

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Why Mirroring Production in Dev Helps You Avoid Costly Mistakes

The setup: A realistic aggregation scenario

The application behind the test

Testing on M0: The hidden risk

Taking it to production: Same query, different outcome

Real-time metrics: Detecting the bottleneck

Query insights: The detective tool

Don’t guess, let Performance Advisor show the way

Resilience under pressure: Testing primary failover

Replication

Failover scenario

Simulating application load

Triggering the test in Atlas

Ready for production?

Ricardo Mello

Ricardo Mello

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Comments (0)

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Do you want your ad here?

Why Mirroring Production in Dev Helps You Avoid Costly Mistakes

The setup: A realistic aggregation scenario

The application behind the test

Testing on M0: The hidden risk

Taking it to production: Same query, different outcome

Real-time metrics: Detecting the bottleneck

Query insights: The detective tool

Don’t guess, let Performance Advisor show the way

Resilience under pressure: Testing primary failover

Replication

Failover scenario

Simulating application load

Triggering the test in Atlas

Ready for production?

Ricardo Mello

Ricardo Mello

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Do you want your ad here?

Related Articles

Comments (0)

Set Event Reminder

Subscribe to foojay updates:

Share with