SQL Best Practices Every Java Engineer Must Know

October 01, 2024
11785 Unique Views
8 min read

Table of Contents

2. Avoid Using SELECT *
3. Use Proper Joins
4. Use WHERE Clauses to Filter Data
5. Limit the Number of Rows Returned
6. Use EXISTS Instead of IN
7. Avoid Functions in WHERE Clauses
8. Use JOINs Instead of Subqueries
9. Optimize Group By and Order By Clauses
10. Use Appropriate Data Types
11. Analyze Query Execution Plans
12. Use Connection Pooling
13. Use Batch Processing
14. Optimize Joins
15. Optimize Subqueries
16. Optimize Aggregations
17. Use Summary Columns
18. Use Materialized Views
19. Monitor and Tune Database Settings
20. Regularly Review and Refactor SQL Code
Summary

In the world of software development, best practices for SQL optimization are a critical skill that every Java engineer must master. Efficient database interactions can significantly enhance the performance of your applications, leading to faster response times and a better user experience.

This comprehensive guide dives into the best practices for SQL query optimization, tailored specifically for Java engineers. From understanding the importance of indexes to mastering joins and leveraging connection pooling, this article covers all the essential techniques you need to write efficient and performant SQL queries and best practices for SQL.

<h3>1. Use Indexes

Indexes can significantly improve query performance by allowing the database to quickly locate and access the data.

Tips:

Create indexes on columns used in WHERE, JOIN, ORDER BY, and GROUP BY clauses.
Use covering indexes to include all columns needed by a query.

⛔ Avoid Practice:

SELECT * FROM users WHERE email = '[email protected]';

🟢 Good Practice:

CREATE INDEX idx_users_email ON users (email);
SELECT name, email FROM users WHERE email = '[email protected]';

This creates an index on the email column of the users table, speeding up searches based on email.

Leverage Function-Based Indexes

Function-based indexes can significantly improve query performance when you frequently search or sort by the result of a function or expression.

Tips:

Create function-based indexes for frequently used expressions in WHERE, ORDER BY, or JOIN conditions.
Use function-based indexes to optimize queries that involve case-insensitive searches or date/time manipulations.

⛔ Avoid Practice:

-- no function-based index applied.
SELECT * FROM employees WHERE UPPER(last_name) = 'SMITH';

🟢 Good Practice:

CREATE INDEX idx_upper_last_name ON employees (UPPER(last_name));
SELECT * FROM employees WHERE UPPER(last_name) = 'SMITH';

This creates a function-based index on the uppercase version of the last_name column, speeding up case-insensitive searches.

In PostgreSQL, these are called expression indexes. Here's an example:

CREATE INDEX idx_lower_email ON users (LOWER(email)); 
SELECT * FROM users WHERE LOWER(email) = '[email protected]';

This creates an expression index on the lowercase version of the email column, optimizing case-insensitive email searches.

Function-based or expression indexes are handy when:

You frequently search on transformed column values (e.g., UPPER, LOWER, substring operations).
It would help if you index computed values or expressions.
You want to optimize queries involving date/time manipulations.

Remember that function-based indexes can significantly improve query performance but also increase storage requirements and slow down data modification operations. Use them judiciously based on your specific query patterns and performance needs.

2. Avoid Using SELECT *

Using SELECT * retrieves all columns from the table, which can be inefficient and lead to unnecessary data transfer.

Tips:

Specify only the columns you need in your SELECT statement.

⛔ Avoid Practice:

SELECT * FROM users;

🟢 Good Practice:

SELECT name, email FROM users;

This query retrieves only the name and email columns, reducing the amount of data transferred.

3. Use Proper Joins

Improper joins can lead to performance issues. Use the correct type of join for your query.

Tips:

Use INNER JOIN for matching rows in both tables.
Use LEFT JOIN to include all rows from the left table and matching rows from the right table.

⛔ Avoid Practice:

SELECT u.name, o.order_date
FROM users u, orders o
WHERE u.id = o.user_id;

🟢 Good Practice:

SELECT u.name, o.order_date
FROM users u
JOIN orders o ON u.id = o.user_id;

This query uses an INNER JOIN to combine data from the users and orders tables.

4. Use WHERE Clauses to Filter Data

Filtering data as early as possible in your query can help reduce the amount of data processed.

Tips:

Use WHERE clauses to filter data efficiently.

⛔ Avoid Practice:

SELECT name, email FROM users;

🟢 Good Practice:

SELECT name, email FROM users WHERE active = true;

This query retrieves only active users, reducing the amount of data processed.

5. Limit the Number of Rows Returned

When you don't need all rows, use the LIMIT clause to restrict the number of rows returned.

Tips:

Use the LIMIT clause to fetch a subset of rows.

⛔ Avoid Practice:

SELECT name, email FROM users WHERE active = true;

🟢 Good Practice:

SELECT name, email FROM users WHERE active = true LIMIT 10;

This query retrieves the first 10 active users, reducing the amount of data processed and transferred.

6. Use EXISTS Instead of IN

Using EXISTS can be more efficient than using IN, especially for large datasets.

Tips:

Use EXISTS for subqueries to check for the existence of rows.

7. Avoid Functions in WHERE Clauses

Using functions in `WHERE` clauses can prevent the use of indexes, leading to slower queries.

Tips:

Avoid using functions on indexed columns in `WHERE` clauses.

⛔ Avoid Practice:

SELECT name, email FROM users WHERE DATE_PART('year', created_at) = 2023;

🟢 Good Practice:

SELECT name, email FROM users WHERE created_at >= '2023-01-01' AND created_at < '2024-01-01';

This query filters on the `created_at` column without using a function, allowing the use of an index.

8. Use JOINs Instead of Subqueries

JOINs are often more efficient than subqueries, especially for large datasets.

Tips:

Use `JOIN` instead of subqueries when possible.

⛔ Avoid Practice:

SELECT name, (
  -- Subquery to get order date for each user
  SELECT order_date 
  FROM orders 
  WHERE user_id = users.id
) AS order_date 
FROM users;

🟢 Good Practice:

SELECT u.name, o.order_date FROM users u JOIN orders o ON u.id = o.user_id;

This query uses a `JOIN` instead of a subquery, improving performance.

9. Optimize Group By and Order By Clauses

Using `GROUP BY` and `ORDER BY` clauses can be resource-intensive. Optimize them to improve performance.

Tips:

Use indexes on columns used in `GROUP BY` and `ORDER BY` clauses.
Reduce the number of columns specified in these clauses.

⛔ Avoid Practice:

SELECT user_id, COUNT(*), MAX(order_date) FROM orders GROUP BY user_id, order_date ORDER BY order_date;

🟢 Good Practice:

SELECT user_id, COUNT(*) FROM orders GROUP BY user_id ORDER BY user_id;

This query groups and orders by indexed columns, improving performance.

10. Use Appropriate Data Types

Choosing the correct data types for your columns can significantly impact performance and storage efficiency.

Tips:

Use appropriate data types for your columns.
Avoid using `TEXT` or `BLOB` unless necessary.

⛔ Avoid Practice:

-- Using TEXT for name and email which may be inefficient
 CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  name TEXT,
  email TEXT,
  created_at TIMESTAMP
);

🟢 Good Practice:

-- Using more appropriate data types for better performance and storage efficiency


 CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  name VARCHAR 100,
  email VARCHAR 100,
  created_at TIMESTAMP
);

This schema uses appropriate data types, improving performance and storage efficiency.

11. Analyze Query Execution Plans

Use tools like `EXPLAIN` to analyze your query execution plans and identify performance issues.

Tips:

Use `EXPLAIN` to understand how your queries are executed.
Identify and optimize slow parts of your queries.

⛔ Avoid Practice:

SELECT name, email FROM users WHERE active = true;

🟢 Good Practice:

EXPLAIN SELECT name, email FROM users WHERE active = true;

This command provides an execution plan for the query, helping identify potential performance issues.

12. Use Connection Pooling

For Java applications, using connection pooling can reduce the overhead of establishing database connections and improve performance.

Tips:

Use a connection pooling library like HikariCP or Apache DBCP.
Configure the pool size based on your application's needs and the database's capabilities.

⛔ Avoid Practice:

Connection conn = DriverManager.getConnection("jdbc:mysql://localhost:3306/mydatabase", "user", "password");
// Use connection here
conn.close();

🟢 Good Practice:

HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:mysql://localhost:3306/mydatabase");
config.setUsername("user");
config.setPassword("password");
config.setMaximumPoolSize(10);
HikariDataSource dataSource = new HikariDataSource(config);

This sets up a connection pool with a maximum of 10 connections, reducing connection overhead.

13. Use Batch Processing

Using batch processing can significantly improve performance when performing multiple insert, update, or delete operations.

Tips:

Batch inserts/updates to reduce database round-trips.
Use prepared statements for batch operations.

⛔ Avoid Practice:

Connection conn = dataSource.getConnection();
Statement stmt = conn.createStatement();
for (User user : userList) {
  stmt.executeUpdate("INSERT INTO users (name, email) VALUES ('" + user.getName() + "', '" + user.getEmail() + "')");
}
stmt.close();
conn.close();

🟢 Good Practice:

Connection conn = dataSource.getConnection(); 
PreparedStatement pstmt = conn.prepareStatement("INSERT INTO users (name, email) VALUES (?, ?)"); 
for (User user : userList) { 
   pstmt.setString(1, 
   user.getName()); 
   pstmt.setString(2, 
   user.getEmail()); 
   pstmt.addBatch(); 
} 
pstmt.executeBatch(); 
pstmt.close(); 
conn.close();

This Java code uses batch processing to insert multiple users efficiently.

14. Optimize Joins

Properly optimizing joins can significantly impact query performance, especially for large datasets.

Tips:

Ensure that columns used in join conditions are indexed.
Start with the smallest table when joining multiple tables.

⛔ Avoid Practice:

SELECT u.name, o.order_date FROM orders o JOIN users u ON u.id = o.user_id WHERE u.active = true;

🟢 Good Practice:

SELECT u.name, o.order_date FROM users u JOIN orders o ON u.id = o.user_id WHERE u.active = true;

This query joins the `users` table with the `orders` table on an indexed column, improving performance.

15. Optimize Subqueries

Subqueries can often be replaced with joins or other more efficient query constructs.

Tips:

Use joins instead of subqueries whenever possible.
Use Common Table Expressions (CTEs) for complex queries to improve readability and sometimes performance.

⛔ Avoid Practice:

SELECT name, email FROM users WHERE id 
IN SELECT user_id FROM orders WHERE order_date > '2023-01-01';

🟢 Good Practice:

WITH RecentOrders AS 
SELECT user_id FROM orders WHERE order_date > '2023-01-01'

SELECT u.name, u.email FROM users u JOIN RecentOrders ro ON u.id = ro.user_id;

This query uses a CTE to improve readability and performance.

16. Optimize Aggregations

When performing aggregation queries, use efficient techniques to minimize the computational load. Tips:

Ensure columns used in `GROUP BY` clauses are indexed.
Consider using summary tables for frequently aggregated data.

⛔ Avoid Practice:

SELECT user_id, COUNT(*) AS order_count, SUM(amount) AS total_amount FROM orders GROUP BY user_id, order_date;

🟢 Good Practice:

SELECT user_id, COUNT(*) AS order_count FROM orders GROUP BY user_id;

This query is grouped by the `user_id` column, which should be indexed for optimal performance.

17. Use Summary Columns

Summary columns store pre-computed aggregate values, reducing the need for expensive computations during query execution. Tips:

Use summary columns to store frequently queried aggregate data.

⛔ Avoid Practice:

SELECT user_id, SUM(amount) AS total_amount 
 FROM orders 
GROUP BY user_id;

🟢 Good Practice:

ALTER TABLE users ADD total_order_amount DECIMAL(10, 2);
UPDATE users u SET total_order_amount = (SELECT SUM(amount) FROM orders o WHERE o.user_id = u.id);

This approach adds a summary column to store the total order amount for each user.

18. Use Materialized Views

Materialized views cache the results of complex queries, improving performance for read-heavy operations. Tips:

Use materialized views to store the results of expensive queries.

⛔ Avoid Practice:

SELECT user_id, COUNT(*) AS order_count, SUM(amount) AS total_amount
 FROM orders
GROUP BY user_id;

🟢 Good Practice:

CREATE MATERIALIZED VIEW user_order_summary AS
SELECT user_id, COUNT(*) AS order_count, SUM(amount) AS total_amount
FROM orders
GROUP BY user_id;

This creates a materialized view that stores the pre-computed summary of user orders.

19. Monitor and Tune Database Settings

Regularly monitor and tune your database settings to ensure optimal performance. Tips:

Adjust memory settings like buffer pool size and cache size based on your workload.
Use tools like `EXPLAIN`, `ANALYZE`, and database-specific monitoring tools to identify and address performance bottlenecks.

⛔ Avoid Practice:

-- Default settings not tuned for workload

🟢 Good Practice:

-- according to PostgreSQL, adjusting the shared_buffers:
-- If you have a dedicated database server with 1GB or more of RAM, a reasonable starting value for shared_buffers is 25% of the memory in your system.


ALTER SYSTEM SET shared_buffers = '2GB';

This command adjusts the buffer pool size in PostgreSQL, which can improve performance for read-heavy workloads.

Refer to the docs: https://www.postgresql.org/docs/9.1/runtime-config-resource.html

20. Regularly Review and Refactor SQL Code

Regularly reviewing and refactoring your SQL code can help identify and address performance issues. Tips:

Conduct regular code reviews to ensure SQL queries are optimized.
Break down complex queries into simpler, more efficient parts.

⛔ Avoid Practice:

-- Original complex query
SELECT u.name,
 (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_count
FROM users u;

🟢 Good Practice:

-- Refactored for better performance
SELECT u.name, COUNT(o.id) AS order_count 
 FROM users u 
 LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.name;

The refactored query joins `users` and `orders` and uses a `GROUP BY` clause, improving performance.

For more best practices guide download my free book here: https://abosaadmuaath.activehosted.com/f/1

Summary

Optimizing SQL queries is essential for modern software development, particularly for Java engineers who work with relational databases. By adhering to best practices such as utilizing indexes, avoiding SELECT *, mastering joins, and optimizing WHERE clauses, you can guarantee efficient queries, shorten load times, and enhance overall application performance. In addition, incorporating connection pooling and selecting the appropriate data types can significantly improve database interactions. Consistently reviewing and refining SQL code, as well as analyzing execution plans is key to maintaining optimal queries and upholding high-performance standards.

Don’t Forget to Share This Post!

Muaath Bin Ali

Author

Muaath Ali is a principal software engineer with 15+ years of Java experience. He runs MezoCode blog, helping Java developers design flexible and robust systems.

Foojay Podcast #75: JCON Report, Part 4 – Tips and Tricks for Java Devs

Testing an OpenRewrite Recipe

Creating Scalable OpenAI GPT Applications in Java

Data Modeling for Java Developers: Structuring With PostgreSQL and MongoDB

Clean and Modular Java: A Hexagonal Architecture Approach

Building a Real-Time AI Fraud Detection System with Spring Kafka and MongoDB

Dissection of Joeffice: Open Source Office Suite in Java

Prime Time: The High Performance Java Event

Project Panama for Newbies (Part 1)

How I Improved Zero-Shot Classification in Deep Java Library (DJL) OSS

foojay: A Place for Friends of OpenJDK

Dashboard for OpenJDK Update Release Details

JDK14: New Features and Enhancements

Fun with Flags: My Top 10 Resources for JVM Flags

Performance of Modern Java on Data-Heavy Workloads: Real-Time Streaming

Performance of Modern Java on Data-Heavy Workloads: Batch Processing

How does Java handle different Images and ColorSpaces – Part 1

How does Java handle different Images and ColorSpaces – Part 2

How does Java handle different Images and ColorSpaces – Part 3

How does Java handle different Images and ColorSpaces – Part 4

Indexing all of Wikipedia, on a laptop

Working with Multiple Carets in IntelliJ IDEA

Clean Shutdown of Spring Boot Applications

Java 17 on the Raspberry Pi

How to Create Mobile Apps with JavaFX (Part 1)

Project Panama for Newbies (Part 1)

Foojay Slack: bit.ly/join-foojay-slack

Beginning JavaFX Applications with IntelliJ IDE

SpringBoot 3.2 + CRaC

Debugging Java on the Command Line

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Learn about a number of experiments that have been conducted with Apache Kafka performance on Azul Platform Prime, compared to vanilla OpenJDK. Roughly 40% improvements in performance, both throughput and latency, are achieved.

Stable, Secure, and Affordable Java

Azul Platform Core is the #1 Oracle Java alternative, offering OpenJDK support for more versions (including Java 6 & 7) and more configurations for the greatest business value and lowest TCO.

Comments (4)

alex

9 months ago

DImension tables are much better for date filtering (especially years). That is, have a table that has every date in it + columns that have the "calendar year" "calendar month" "calendar quarter" etc. So if you're filtering on a year, you don't do > or < you do an Inner Join on the lu.DimDate table on [created_at] = [dimDate.date] and calendarYear = '2023' Inner Join filtering is significantly faster than "where" (order of operations for the query engine, you're significantly reducing the dataset earlier in the process) And, the = on the "calendarYear" is significantly faster than ranges

Bartek

In my opinion, the advice to use joins instead of subqueries whenever possible is not good. I had many cases where a query with many joins was much worse than a query with subqueries. A good piece of advice would be: always check your query with explain, because sometimes a query with subqueries can be much more efficient than a query with joins.

Juan Farizo

I agree with @Bartek, not always is the best to avoid subqueries instead of joins. If we are talking about performance, the premise has to be measuring the throughput and the resources used on the sql server side.

Cristian

8 months ago

The advice about using VARCHAR 100 instead of TEXT is bad. You should reverse the advice to recommend using TEXT instead of VARCHAR, always, at least for Postgres: * VARCHAR and TEXT have the same exact performance in Postgres * VARCHAR 100 will be slightly **less** performant than TEXT, since it needs to enforce the maximum length * adding artificial maximum lengths is very likely to cause application bugs. Your example has 2 bugs: (1) you don't allow people with full names more than 100 characters, but there are such persons in the world - are they not allowed to use your application? (2) according to RFC3696, emails are allowed to be have up to 320 characters, but you're limiting it to 100.

Jakarta EE 11: Beyond the Era of Java EE

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Stable, Secure, and Affordable Java

Do you want your ad here?

SQL Best Practices Every Java Engineer Must Know

Leverage Function-Based Indexes

2. Avoid Using SELECT *

3. Use Proper Joins

4. Use WHERE Clauses to Filter Data

5. Limit the Number of Rows Returned

6. Use EXISTS Instead of IN

7. Avoid Functions in WHERE Clauses

8. Use JOINs Instead of Subqueries

9. Optimize Group By and Order By Clauses

10. Use Appropriate Data Types

11. Analyze Query Execution Plans

12. Use Connection Pooling

13. Use Batch Processing

14. Optimize Joins

15. Optimize Subqueries

16. Optimize Aggregations

17. Use Summary Columns

18. Use Materialized Views

19. Monitor and Tune Database Settings

20. Regularly Review and Refactor SQL Code

Summary

Muaath Bin Ali

Muaath Bin Ali

Thanks to our Sponsors!

Azul

Redis

CodeRabbit

Reo

Zencoder

Payara

Digma

adesso

Trending

Apache Kafka Performance on Azul Platform Prime vs Vanilla OpenJDK

Stable, Secure, and Affordable Java

Step up your coding with the Continuous Feedback Udemy Course: Additional coupons are available

Jakarta EE 11: Beyond the Era of Java EE

Stable, Secure, and Affordable Java

Do you want your ad here?

Related Articles

Comments (4)

alex

Bartek

Juan Farizo

Cristian

Set Event Reminder

Subscribe to foojay updates:

Share with