JVector 1.0

October 02, 2023
9844 Unique Views
2 min read

JVector is a pure Java embedded vector search engine that powers DataStax Astra and is being added to Apache Cassandra.

Vector search is a critical part of today’s generative AI applications, allowing developers to quickly retrieve the most relevant context to give the large language model enough information to answer accurately and without hallucinating, but innovation in this space has mostly happened outside the Java ecosystem. JVector gives enterprises an easy way to capitalize on their investment in the powerful Java platform, and gives Java developers a state-of-the-art solution that is easy to embed in their applications.

JVector’s closest relative is Apache Lucene’s vector search. Lucene implements the HNSW vector search algorithm, which is known to be fast but memory-hungry. Because it is based on the more sophisticated DiskANN algorithm, JVector is over 10x faster than Lucene for large datasets, holding other things equal. For example, here is a comparison of searching the Deep100M dataset (about 35GB of vectors and 20GB of index data) with Lucene and with JVector:

JVector is fast, memory-efficient, disk-aware, concurrent, easy to embed, and incremental.

Incremental means that you can start searching your JVector index immediately. There are no batches or microbatches or “commit” stages to wait for.

Concurrent means that you can build and search a JVector index with multiple threads simultaneously. Here you can see that doubling the number of threads adding vectors cuts build time in half, out to 32 threads. (X and Y axes are both logarithmic.)

JVector is designed to be straightforward to embed while preserving high performance. Here is the code to compute the index for the SIFT dataset shown above. In under 100 lines it

Computes product quantization for the vectors (a kind of compression)
Loads the vectors into the index, in parallel
Saves the index to disk
Conducts searches in parallel, against both in-memory and on-disk indexes
Computes recall vs ground truth and reports performance numbers

JVector runs on JDK11+, and takes advantage of Panama SIMD acceleration on JDK 20+. JVector is available under the Apache License 2.0.

Try it out today and let us know what you think!

Don’t Forget to Share This Post!

Jonathan Ellis

Author

Jonathan is the founder of Brokk (https://brokk.ai). Brokk keeps LLMs on-task in million-line codebases by adding compiler-grade understanding of your code's structure and semantics. Jonathan is also the author of JVector, co-founder of DataStax, and the founding project chair of Apache Cassandra.

Testing an OpenRewrite Recipe

Foojay Podcast #75: JCON Report, Part 4 – Tips and Tricks for Java Devs

Data Modeling for Java Developers: Structuring With PostgreSQL and MongoDB

Creating Scalable OpenAI GPT Applications in Java

Clean and Modular Java: A Hexagonal Architecture Approach

Dissection of Joeffice: Open Source Office Suite in Java

Building a Real-Time AI Fraud Detection System with Spring Kafka and MongoDB

Prime Time: The High Performance Java Event

Project Panama for Newbies (Part 1)

How I Improved Zero-Shot Classification in Deep Java Library (DJL) OSS

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Java 17 on the Raspberry Pi

How to Create Mobile Apps with JavaFX (Part 1)

Project Panama for Newbies (Part 1)

Foojay Slack: bit.ly/join-foojay-slack

Beginning JavaFX Applications with IntelliJ IDE

SpringBoot 3.2 + CRaC

Debugging Java on the Command Line

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

JVector 1.0

Jonathan Ellis

Jonathan Ellis

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Stable, Secure, and Affordable Java

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Comments (0)

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Do you want your ad here?

JVector 1.0

Jonathan Ellis

Jonathan Ellis

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Stable, Secure, and Affordable Java

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Do you want your ad here?

Related Articles

Comments (0)

Set Event Reminder

Subscribe to foojay updates:

Share with