Duplicate Finder for Documentation

May 08, 2024
2500 Unique Views
3 min read

Table of Contents

Other languages: Español 한국어 Português 中文

This post is about the development of the duplicate finder tool. For downloads and instructions on how to use it, see the 'Download' page Download

Anyone who worked on technical documentation in a big team is certainly aware of the content duplication issue. Even with the best tools and practices at hand, duplication is fundamentally difficult to overcome.

Duplicate Finder for Documentation - Post banner

As the project grows in size, duplicated content will start to occur. This is especially true for big projects including many similar products or features.

Good:

define once:

<p> 
    If you encounter any issues, refer to the troubleshooting guide
    or contact support. 
</p>

reuse elsewhere:

<TroubleshootingNote/>

Bad:

<p>
    If you encounter any issues, refer to the troubleshooting
    guide or contact support.
</p>

<!-- same meaning, slightly different wording-->
<p>
    In case of problems, consult the troubleshooting guide
    or contact support
</p>

The idea that advocates against duplication is commonly known as DRY Principle. Though it is primarily associated with programming, the same property is highly favoured in documentation.

Project intro

Modern authoring tools typically have features for content reuse, making technical constraints less of a concern. The real problem, on the other hand, lies in spotting duplicates. Before you extract something to a reusable chunk, you need to know what to extract.

If you are a programmer, your IDE might highlight duplicate code for you:

IntelliJ IDEA hightlights duplicated code

Unfortunately, the same feature is not suitable for documentation, as it relies on comparing abstract syntax trees (AST). This approach doesn't work well with text.

One of my ongoing projects is to implement a duplicate finder for documentation. The tool will be capable of quickly finding non-exact, or fuzzy, matches, such as the bad example above.

Current status

As of this writing, the project is WIP, but there is already a working prototype:

The UI of the duplicate finder tool prototype showing several detected duplicates in a dummy project

The algorithm takes under 30 seconds to analyze a project with ~6k source files on my MBP M1, and I'm planning on improving it to instantly highlight duplicates right as you type in the editor.

The prototype has already helped me and my colleagues find a lot of duplicates in real projects, so I'm quite enthusiastic about the results and future improvements.

What's next

In the following posts, I will lay out the algorithm step-by-step and perform benchmarks to evaluate its performance. If you are into programming, you are welcome to code along.

Alternatively, you can keep an eye on the progress and use the final deliverable when the project is complete. Once finished, this feature will be available in Writerside, a great authoring tool made by my colleagues.

I hope that the project description resonates with you, and that you'll find the walkthrough useful. You won't miss the future chapters of this series if you regularly check out Foojay, but it's still a good idea to subscribe to my blog and Twitter account.

See you in the next posts!

Don’t Forget to Share This Post!

Igor Kulakov

Author

Technical writer at JetBrains, hobbyist developer, author at flounder.dev

Preparing for Spring Framework 7 and Spring Boot 4

Domain-Driven Design in Java: A Practical Guide

New Java Benchmark for Coding LLMs puts GPT-5 at the top

New Features in Jakarta EE 11, with Examples

Managing MongoDB Database Changes Using Liquibase Pro

Project Panama for Newbies (Part 1)

JC-AI Newsletter #3

🧱 Monolith or 🧩 Microservices in 2025?

Understanding MCP Through Raw STDIO Communication

OpenTelemetry Tracing on the JVM

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Java 17 on the Raspberry Pi

Project Panama for Newbies (Part 1)

How to Create Mobile Apps with JavaFX (Part 1)

Foojay Slack: bit.ly/join-foojay-slack

Beginning JavaFX Applications with IntelliJ IDE

SpringBoot 3.2 + CRaC

Debugging Java on the Command Line

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Duplicate Finder for Documentation

Project intro

Current status

What's next

Igor Kulakov

Igor Kulakov

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Comments (0)

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Do you want your ad here?

Duplicate Finder for Documentation

Project intro

Current status

What's next

Igor Kulakov

Igor Kulakov

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Jakarta EE 11: Beyond the Era of Java EE

Do you want your ad here?

Related Articles

Comments (0)

Set Event Reminder

Subscribe to foojay updates:

Share with