The Anatomy of SQL Subqueries: A Data Scientist’s Guide

DataGrad
5 min readOct 22, 2023

--

In our previous blogs in this SQL series, we’ve covered SQL essentials and Advanced techniques for data scientists. Today, we’re going to deep dive into a topic that often intimidates beginners but is extremely powerful once understood — SQL Subqueries. Subqueries allow for more complex operations and can simplify your SQL statements, all while making them more efficient and maintainable.

The Anatomy of SQL Subqueries: A Data Scientist’s Guide

What you will learn:

What is a SQL Subquery?

When to Use Subqueries?

Correlated Subqueries

Common Pitfalls and How to Avoid Them

Subqueries vs Joins

Understanding the Dataset

Before we dig into the world of subqueries, let’s get familiar with the dataset we are using for this learning. This dataset contains cricket match data, including details like match ID, season, start date, venue, batting team, bowling team, and so on.

Here is a sample from our dataset:

Cricket Data Snapshot
Cricket Data Snapshot
Cricket Data Snapshot

Note: The table is truncated to show only a subset of columns and rows.

Section 1: What is a SQL Subquery?

A SQL Subquery, also known as a nested or inner query, is a query within another SQL query. Subqueries enable you to perform more complex tasks and answer multi-layered questions, all within a single SQL command. Let’s start by classifying subqueries into different types based on their use-cases and the kind of data they return.

Types of Subqueries

  1. Scalar Subqueries: Returns a single value.
  2. Row Subqueries: Returns a single row with multiple columns.
  3. Column Subqueries: Returns a single column but with multiple rows.
  4. Table Subqueries: Returns entire tables.

Scalar Subqueries

Suppose you want to find out who scored the most runs in a particular season from our sample cricket dataset. A scalar subquery can help you find this information in a single SQL command.

SELECT striker, MAX(runs)
FROM cricket_data
WHERE season = '2007/08'
AND runs =
(
SELECT MAX(runs)
FROM cricket_data
WHERE season = '2007/08'
);

Row Subqueries

Let’s say you want to find out which players scored more than the average runs in the ‘2007/08’ season.

SELECT striker, runs
FROM cricket_data
WHERE season = '2007/08'
AND runs >
(
SELECT AVG(runs)
FROM cricket_data
WHERE season = '2007/08'
);

Column Subqueries

What if you want to find out all the matches where players from ‘Kolkata Knight Riders’ participated?

SELECT match_id, season
FROM cricket_data
WHERE striker IN
(
SELECT striker
FROM cricket_data
WHERE batting_team = 'Kolkata Knight Riders'
);

Table Subqueries

If you want to find the average runs scored in matches involving ‘Kolkata Knight Riders’, you might use a table subquery in the FROM clause.

SELECT AVG(runs)
FROM
(
SELECT runs
FROM cricket_data
WHERE striker IN
(
SELECT striker
FROM cricket_data
WHERE batting_team = 'Kolkata Knight Riders'
)
) AS KKR_Matches;

Section 2: When to Use Subqueries?

In SQL, there are often multiple ways to achieve the same result. Subqueries are most effective when:

  1. You need to retrieve a single value for further calculations: Scalar subqueries are perfect for this.
  2. You are dealing with hierarchical or relational data: Subqueries can simplify complex joins.
  3. You want to perform calculations that involve multiple steps: Subqueries can break down the problem into smaller, more manageable tasks.

Example:

  • Finding the Top Scorer’s Average Runs in a Season

Here, we first find out who the top scorer in the ‘2007/08’ season is and then calculate their average runs. We’ll use a scalar subquery to find the top scorer and a main query to find their average runs.

SELECT striker, AVG(runs)
FROM cricket_data
WHERE season = '2007/08'
AND striker =
(
SELECT striker
FROM cricket_data
WHERE season = '2007/08'
GROUP BY striker
ORDER BY SUM(runs) DESC
LIMIT 1
);

Section 3: Correlated Subqueries

A correlated subquery is a subquery that uses values from the outer query. Essentially, the inner query depends on the outer query.

Example:

  • Finding Players Who Scored More Than Their Team’s Average

In this example, we want to find out the players who have scored more than the average runs scored by their team in the ‘2007/08’ season.

SELECT striker, runs, batting_team
FROM cricket_data AS outer_query
WHERE season = '2007/08'
AND runs >
(
SELECT AVG(runs)
FROM cricket_data AS inner_query
WHERE inner_query.batting_team = outer_query.batting_team
AND inner_query.season = '2007/08'
);

In this case, the inner query is correlated with the outer query, as it uses batting_team from the outer query to calculate the team's average runs.

Section 4: Common Pitfalls and How to Avoid Them

Subqueries can be powerful but also tricky. Here are some common pitfalls:

  1. Nesting too Deep: The deeper you nest your subqueries, the harder they are to read and maintain. Try to limit the nesting levels.
  2. Performance: Subqueries can sometimes lead to inefficient queries. Always test the performance of your queries.
  3. Debugging: Debugging nested subqueries can be difficult. It’s a good practice to test each subquery individually.

Example:

  • Debugging a Complex Subquery

If you encounter issues with a complex subquery, you can isolate each subquery and run it independently to verify its output. For instance, in the correlated subquery example above, you could first run the inner query to ensure it gives you the average runs for a specific team.

Section 5: Subqueries vs Joins

Subqueries and joins often serve similar purposes, but they aren’t always interchangeable. Here’s how to decide when to use which:

Use Subqueries When:

  • You need to perform a calculation before the main query.
  • You want to pull data from multiple tables that aren’t directly related.

Use Joins When:

  • You’re combining rows from different tables based on a related column.
  • You don’t need to perform calculations before the main query.

Example: Using a Join Instead of a Subquery

  • Here’s how you could find players who scored more than their team’s average runs in the ‘2007/08’ season using a JOIN:
SELECT o.striker, o.runs, o.batting_team
FROM cricket_data AS o
JOIN
(
SELECT batting_team, AVG(runs) AS team_avg
FROM cricket_data
WHERE season = '2007/08'
GROUP BY batting_team
) AS i
ON o.batting_team = i.batting_team
WHERE o.runs > i.team_avg
AND o.season = '2007/08';

Conclusion

Subqueries are an essential tool for data scientists who aim to write versatile, efficient, and clean SQL code. Mastering subqueries can significantly streamline your data manipulation and extraction tasks, making you more proficient in dealing with complex datasets like our cricket series example. Whether you’re calculating aggregates, filtering data, or dealing with hierarchical data, subqueries offer a robust way to simplify your SQL queries while also making them more powerful.

As always, keep querying and keep exploring!

#datascience #datascientist #SportsAnalytics #SQL

Please read the other blogs of this series:

1. SQL Essentials for Future Data Scientists,

2. Advanced SQL Techniques for Data Scientists: A Deep Dive

Let’s connect on Linkedin.com/in/jhakamal to collaborate on projects!

https://www.linkedin.com/in/jhakamal/
https://www.linkedin.com/in/jhakamal/

--

--

DataGrad
DataGrad

Written by DataGrad

A Technical Consultant, with expertise in Data Analysis and Visualization. Open to work!

No responses yet