From Zero to Vector Hero – Locally!

September 18, 2025
2312 Unique Views
4 min read

In the previous issue, I explained how to run a local MongoDB Atlas cluster using Atlas CLI - no cloud account required. If you missed it, read it here 👉 Run an Atlas cluster locally in minutes. Now let’s see how to use Vector Search in that local environment.

🕒 Reading time: 3-4 min

🧠 Explaining the Embedding Workflow

After launching your local MongoDB Atlas cluster and running the show dbs command in mongosh, you’ll see only the default system databases: admin, config, and local. These are used internally by MongoDB and contain no user data or vector embeddings at this point.

To understand how embeddings come into play, take a look at the diagram below. It illustrates how they are generated and stored in MongoDB together with application data.

Image by MongoDB 2024. Process of generating embeddings from data and using them for similarity search.

Raw data is processed by an embedding model, which produces a high-dimensional vector. This vector is stored in a MongoDB collection and used to support semantic queries via the $vectorSearch aggregation pipeline operator.

You can generate embeddings using a model such as OpenAI or Voyage AI (🔁 read 👉 How to Create Vector Embeddings), and store them along with any relevant metadata.

👉 If you want to understand better how Vector Search works in MongoDB, check out this article 👉 Power your AI application with Vector Search

🔢 Loading embeddings into MongoDB

Alternatively, you can load a sample MongoDB dataset that already contains pre-generated vector embeddings using mongorestore. First, make sure MongoDB Database Tools are installed. Then, download the sample dataset with curl, as shown in the example below:

curl https://atlas-education.s3.amazonaws.com/sampledata.archive -o sampledata.archive

Find the connection string for your local MongoDB Atlas cluster with:

atlas deployments connect --connectWith connectionString

You will get a connection string similar to "mongodb://localhost:55015/?directConnection=true". Then, load the sample dataset using mongorestore and the connection string:

mongorestore --archive=sampledata.archive --uri 
"mongodb://localhost:55015/?directConnection=true"

After reconnecting to your local Atlas cluster, run show dbs to confirm that the new sample_mflix database has been added. It includes the embedded_movies collection with pre-generated vector embeddings from the MongoDB sample dataset.

🔎 Finding embeddings

To retrieve a document from the embedded_movies collection within this database, run the following command:

db.getSiblingDB("sample_mflix").embedded_movies.findOne()

This command queries the sample_mflix.embedded_movies namespace and returns a single document containing standard movie metadata such as title, cast, genres, and release date. It also includes one or more vector embeddings of the plot field, which are stored as Float32Array binaries. Here is a simplified example of the returned document:

{
  "_id": ObjectId("573a1392f29313caabcd9ca6"),
  "title": "Scarface",
  "plot": "An ambitious and near insanely violent gangster climbs the ladder of success...",
"plot_embedding": Binary.fromFloat32Array(new Float32Array([
 -0.0155, -0.0342, 0.0152, -0.0426, -0.0208, 0.0263,
 // ... 1436 more values ...
  ])),
 "plot_embedding_voyage_3_large": Binary.fromFloat32Array(new Float32Array([
 -0.0300, 0.0311, -0.0156, -0.0366, 0.0248, 0.0085,
    // ... 1948 more values ...
  ]))
}

The example includes two different embeddings of the same plot: plot_embedding contains 1536-dimensional vectors generated using OpenAI’s text-embedding-ada-002 model, while plot_embedding_voyage_3_large contains 2048-dimensional vectors from Voyage AI’s voyage-3-large model.

These vectors enable semantic comparison. For example, they allow you to find movies with similar narrative content, tone, or themes, even if the descriptions don't share the same words.

Now you just need to create a vector index on the embedding field, and you’ll be ready to perform semantic search. This is required for the $vectorSearch aggregation stage to work efficiently.

🛠️ Creating Vector Search index

Use the createSearchIndex command to define a vector index on the plot_embedding_voyage_3_large field. This enables fast similarity search over 2048-dimensional vectors.

db.getSiblingDB("sample_mflix").embedded_movies.createSearchIndex({
  name: "plot_embedding_voyage_index",
  definition: {
   mappings: {
     dynamic: false,
       fields: {
          plot_embedding_voyage_3_large: {
          type: "knnVector",
          dimensions: 2048,
          similarity: "cosine"
        }
      }
    }
  }
})

The plot_embedding_voyage_3_large field is indexed as a knnVector, a specialized vector field designed for storing high-dimensional numeric data. Cosine means the similarity between vectors is based on the angle between them; the smaller the angle, the higher the similarity, regardless of their magnitude.

To confirm the index exists, run:

db.getSiblingDB("sample_mflix").embedded_movies.getSearchIndexes()

You're now ready to run similarity queries against this field using the $vectorSearch operator.

★ The query must include a vector input with exactly 2048 float values to match the index dimensions. This vector must also be generated by the same embedding model used to produce the stored vectors, ensuring that semantic meaning is comparable. This allows MongoDB to compare the input vector with indexed vectors using cosine similarity.

📙 What’s Next

In the next episode, you’ll learn how to run similarity queries using the $vectorSearch operator. We’ll use the vector index you just created to search for documents with similar plot embeddings in your local Atlas environment.

📘 More tips like this

Want more hands-on examples, best practices, and deep dives into MongoDB 8.0 and the Atlas platform? Check out 👉 MongoDB in Action: Building on the Atlas Data Platform. Published by Manning Publications Co.

Don’t Forget to Share This Post!

Arek Borucki

Author

Senior Platform Engineer | Kubernetes | GCP | MongoDB

Dissection of Joeffice: Open Source Office Suite in Java

Anahata: A Pure-Java, Apache Licensed, Open Source AI Agent for NetBeans 28 launched today

Spring Boot 4 OpenTelemetry Guide: Metrics, Traces, and Logs Explained

Understanding MCP Through Raw STDIO Communication

Preparing for Spring Framework 7 and Spring Boot 4

Project Panama for Newbies (Part 1)

🧱 Monolith or 🧩 Microservices in 2025?

Rate limiting with Redis: An essential guide

Project Panama for Newbies (Part 2)

Top 7 Java Microservices Frameworks

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Java 17 on the Raspberry Pi

Project Panama for Newbies (Part 1)

How to Create Mobile Apps with JavaFX (Part 1)

Beginning JavaFX Applications with IntelliJ IDE

Foojay Slack: bit.ly/join-foojay-slack

SpringBoot 3.2 + CRaC

Creating Scalable OpenAI GPT Applications in Java

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Best Practices for Deploying MongoDB in Kubernetes

Table of Contents 1. Use the MongoDB Kubernetes Operator2. StatefulSets and persistent volumes: Running MongoDB the right way Example: Recommended multiple volume configuration 3. Set CPU and memory resources for MongoDB and the Operator MongoDB Kubernetes Operator: Plan for initial …

Sep 16 5,3K

Tim Kelly

Kubernetes

Mongo Databases

Building a Spring Boot CRUD Application Using MongoDB’s Relational Migrator

Table of Contents Pre-requisitesRelational MigratorMigrating the data from the PostgreSQL schema to MongoDB Analysing the Postgres schema Creating mappings to generate the equivalent MongoDB schema Migrating the data into MongoDB Code generation with Relational MigratorBuilding Spring Boot application Examples of …

Jul 08 2,3K

Aasawari Sahasrabuddhe

Mongo

Java Databases

Building REST APIs in Java with Spring Boot

Table of Contents How can Spring help? Prerequisites Creating our appConnecting our database Our Book model Book repository Our REST controller CreateReadUpdateDeleteAdding DTOs and validation BookRequest BookResponse Updating the controller Testing the API Run the API Create Read Update Delete …

Sep 09 4,7K

Tim Kelly

Mongo

Spring Java Databases

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

From Zero to Vector Hero – Locally!

Arek Borucki

Arek Borucki

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Comments (0)

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Do you want your ad here?

From Zero to Vector Hero – Locally!

Arek Borucki

Arek Borucki

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Stable, Secure, and Affordable Java

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Do you want your ad here?

Related Articles

Comments (0)

Set Event Reminder

Subscribe to foojay updates:

Share with