Advanced SQL Techniques for Data Scientists: A Deep Dive

DataGrad
12 min readOct 21, 2023

--

Welcome back to our data science series! In the previous installment (SQL Essentials for Future Data Scientists), we laid the groundwork by exploring SQL essentials, the building blocks for anyone stepping into the world of data science. As you may recall, we used a relatable cricket series dataset to demonstrate how SQL serves as the linchpin in data extraction, manipulation, and analysis.

As we move forward, you’ll realize that the foundational skills you’ve acquired are just the beginning. Data science is not merely about fetching rows or making simple aggregations; it’s about diving deep into the data to extract meaningful insights, make predictions, and drive decision-making. In that context, SQL offers a plethora of advanced techniques that can drastically simplify and optimize these intricate tasks.

Dealing with large datasets is a common scenario in data science, especially with the increasing prevalence of big data. In such environments, every millisecond counts, and optimizing your queries can translate to significant resource savings. This is where advanced SQL techniques come into play. They provide the means to write more efficient queries, allowing you to handle more complex data manipulation and analysis tasks without straining your computational resources.

Today, we’ll embark on a journey to uncover these advanced SQL techniques. From the power of subqueries to the efficiency of indexing, from the elegance of Common Table Expressions to the robustness of stored procedures and triggers, we’ll explore how these advanced features can elevate your SQL game.

So, fasten your seatbelts as we dive deeper into the SQL ocean, revealing its hidden capabilities and showing how you can leverage them to become a more proficient data scientist.

Advanced SQL Techniques for Data Scientists: A Deep Dive

What You’ll Learn

Subqueries

Window Functions

Common Table Expressions (CTEs)

Indexing

Stored Procedures and Triggers

Optimization Tips for Complex Queries

Section 1: Subqueries

Subqueries, often termed as inner queries or nested queries, allow you to resolve multiple questions in a single SQL command. Essentially, a subquery is a query within another SQL query. They are particularly useful for running multiple operations in a single SQL statement, making your data manipulation and extraction tasks much more efficient.

Types of Subqueries

  1. Scalar Subqueries: These return a single value. Scalar subqueries are often found in SELECT, WHERE, and FROM clauses.
  2. Row Subqueries: These return a single row of multiple columns. They are often used with operators like IN, NOT IN, EXISTS, and NOT EXISTS.
  3. Column Subqueries: These return a single column but multiple rows and are used in the WHERE clause.
  4. Table Subqueries: These return entire tables and are generally used in the FROM clause.

Scalar Subqueries: Finding the Highest Scorer in a Series

Imagine you want to find out who scored the most runs in the ‘Ashes 2023’ series from our sample cricket series dataset. A scalar subquery can help you find this information in a single SQL command.

Here’s how:

SELECT player_id, runs_scored
FROM Matches
WHERE series_id = 1 AND runs_scored = (
SELECT MAX(runs_scored)
FROM Matches
WHERE series_id = 1
);

Row Subqueries: Finding Players Who Scored More Than Average in ‘Ind vs Eng 2023’

Row subqueries can be used to find out which players scored more than the average runs in the ‘Ind vs Eng 2023’ series.

SELECT player_id, runs_scored
FROM Matches
WHERE series_id = 2 AND runs_scored > (
SELECT AVG(runs_scored)
FROM Matches
WHERE series_id = 2
);

Column Subqueries: Finding Matches With Players from a Specific Team

If you want to find out the matches where players from ‘India’ participated in, you can use a column subquery.

SELECT match_id, series_id
FROM Matches
WHERE player_id IN (
SELECT player_id
FROM Players
WHERE team = 'India'
);

Table Subqueries: Aggregating Data from a Filtered Set of Rows

If you want to find the average runs scored in matches involving ‘Australia,’ you might use a table subquery in the FROM clause.

SELECT AVG(runs_scored)
FROM (
SELECT runs_scored
FROM Matches
WHERE player_id IN (
SELECT player_id
FROM Players
WHERE team = 'Australia'
)
) AS AustralianMatches;

Subqueries offer a robust mechanism to conduct complex queries in a simpler and more organized manner. Whether you’re running statistical analyses or just looking to filter data in a particular way, understanding how to use subqueries effectively will significantly enhance your SQL toolkit.

By using subqueries, you can answer complex questions using our cricket series dataset that otherwise would have required multiple SQL queries or additional data manipulation steps in Python or R. Mastering subqueries is, therefore, a crucial skill for data scientists who aim to write efficient and clean SQL code.

Section 2: Window Functions

Window functions are a family of SQL operations that perform a calculation across a specific range of rows that are somehow related to the current row within the result set. Unlike regular SQL functions that treat each row independently, window functions have access to more than just the current row, allowing for more complex and flexible calculations. They are often used for tasks that require comparative data analysis, such as ranking or running totals.

Types of Window Functions

There are several types of window functions, and we can broadly categorize them as follows:

  1. Ranking Functions: These functions provide a ranking to each row within a result set. Examples include RANK(), DENSE_RANK(), and ROW_NUMBER().
  2. Aggregate Functions: These are similar to standard SQL aggregate functions like SUM(), AVG(), and COUNT(), but used as window functions.
  3. Navigational Functions: These functions are used to fetch data from a different row than the current row without changing the layout of the result set. Examples include FIRST_VALUE(), LAST_VALUE(), LEAD(), and LAG().

Ranking Functions: Identifying Top Scorers in ‘Ashes 2023’

Let’s say you want to rank players based on the runs they scored in the ‘Ashes 2023’ series. You can use the RANK() function to do this.

SELECT player_id, runs_scored,
RANK() OVER (ORDER BY runs_scored DESC) AS ranking
FROM Matches
WHERE series_id = 1;

Aggregate Functions: Calculating Running Average in ‘Ind vs Eng 2023’

If you’re interested in knowing the running average of runs scored in the ‘Ind vs Eng 2023’ series, you can use the AVG() function as a window function.

SELECT match_id, runs_scored,
AVG(runs_scored) OVER (ORDER BY match_id) AS running_avg
FROM Matches
WHERE series_id = 2;

Navigational Functions: Finding Previous and Next Match Performance

Navigational functions like LAG() and LEAD() can be useful to compare a player's performance with their previous or next match.

SELECT match_id, player_id, runs_scored,
LAG(runs_scored) OVER (ORDER BY match_id) AS prev_run,
LEAD(runs_scored) OVER (ORDER BY match_id) AS next_run
FROM Matches
WHERE player_id = 1;

Window functions open up a plethora of options for complex data manipulation and are integral for data scientists who work on analytics reports, dashboards, and data transformations. They provide a way to perform calculations across sets of table rows that are related to the current row within an SQL query. By understanding how to use these functions effectively, you can elevate your data analysis capabilities to a whole new level.

By using these window functions, you’ll be able to gain more insights from our sample cricket series dataset. This will allow you to not only ask more complex questions of your data but also to answer them in an efficient manner, thereby making you a more proficient data scientist.

Section 3: Common Table Expressions (CTEs)

Common Table Expressions, or CTEs, provide a way to simplify complex queries by breaking them down into smaller, more manageable pieces. A CTE essentially creates a named temporary result set that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement. CTEs improve the readability of your SQL code, making it easier to understand and maintain. They are particularly useful for recursive queries, hierarchical data, or when you need to use the same subquery multiple times within the main query.

Anatomy of a CTE

A Common Table Expression is defined using the WITH clause followed by the CTE name and the query it encapsulates. Once defined, the CTE can be used in the main query as if it were a regular table.

Here is the basic syntax:

WITH cte_name (column_name1, column_name2, ...)
AS (
-- CTE query here
)
-- Main query using the CTE
SELECT * FROM cte_name;

Using CTEs: Calculating Average Runs per Player

Let’s say you want to find the average runs scored by each player in the ‘Ashes 2023’ series. A CTE can help make this task straightforward.

WITH PlayerAverageRuns (player_id, avg_runs) AS (
SELECT player_id, AVG(runs_scored)
FROM Matches
WHERE series_id = 1
GROUP BY player_id
)
SELECT p.player_name, avg_runs
FROM PlayerAverageRuns AS par
JOIN Players AS p
ON par.player_id = p.player_id
ORDER BY avg_runs DESC;

Using CTEs for Recursive Queries: Finding Cumulative Runs

Suppose you want to find the cumulative runs scored by a player in the ‘Ind vs Eng 2023’ series. You can use a recursive CTE to solve this problem.

WITH RECURSIVE CumulativeRuns (match_id, player_id, runs_scored, total_runs) AS 
(
SELECT match_id, player_id, runs_scored, runs_scored
FROM Matches
WHERE series_id = 2 AND match_id = 1
UNION ALL
SELECT m.match_id, m.player_id, m.runs_scored, m.runs_scored + cr.total_runs
FROM Matches AS m, CumulativeRuns AS cr
WHERE m.series_id = 2
AND m.match_id = cr.match_id + 1
AND m.player_id = cr.player_id
)
SELECT * FROM CumulativeRuns;

Common Table Expressions offer a way to make your SQL queries more modular and readable. They are especially useful when you need to decompose complex queries into simpler parts, allowing for a more structured and understandable SQL codebase. Mastering CTEs will enable you to write efficient and clean SQL code, making your data manipulation and extraction tasks much easier.

By leveraging CTEs, you can answer complicated questions using our sample cricket series dataset that would otherwise require cumbersome query structures. Understanding CTEs is an essential skill for data scientists aiming to write complex yet maintainable SQL code.

Section 4: Indexing

Indexing is a database optimization technique that speeds up data retrieval operations. Just as a book index helps you find content quickly without reading the entire book, a database index allows the database engine to retrieve data without scanning the entire table. This is crucial in a data science context where you often deal with large datasets and need to fetch or filter data rapidly. Indexing not only reduces the data retrieval time but also enhances the performance of JOIN, WHERE, and ORDER BY clauses.

Types of Indexes

There are multiple types of indexes in SQL, but the most commonly used are:

  1. Single-Column Index: Created on a single table column.
  2. Composite Index: Created on two or more columns of a table.
  3. Unique Index: Ensures that all the values in a column are different.
  4. Full-text Index: Used for full-text searches.

Creating Indexes: Single-Column and Composite Index

Let’s consider our ‘Matches’ table from the cricket series dataset. If you frequently query this table to find records based on series_id or player_id, creating an index on these columns could be beneficial.

Here’s how to create a single-column index on series_id:

CREATE INDEX idx_series 
ON Matches(series_id);

And a composite index on series_id and player_id:

CREATE INDEX idx_series_player 
ON Matches(series_id, player_id);

Using Indexes: Speeding up Queries

Once the indexes are in place, queries that filter data based on series_id or player_id will be faster. The database engine will automatically use these indexes, so no changes are required in your SQL queries.

Here’s a query that would benefit from our newly created index:

SELECT player_id, AVG(runs_scored)
FROM Matches
WHERE series_id = 1
GROUP BY player_id;

Maintaining Indexes: Update and Delete

While indexes speed up data retrieval, they can slow down data modification operations like INSERT, UPDATE, and DELETE. Therefore, it’s crucial to maintain a balance. Indexes should be updated or deleted as per the evolving needs of your data workload.

To delete an index:

DROP INDEX idx_series;

Indexing is a powerful technique that can significantly speed up data retrieval operations, a crucial factor when dealing with large datasets. As a data scientist, knowing when and how to use indexes can optimize your SQL queries and make your data processing tasks more efficient.

By understanding how to implement and use indexes effectively, you’ll be better prepared to deal with performance issues in your SQL queries, especially when working with extensive databases like our cricket series dataset.

Section 5: Stored Procedures and Triggers

Stored Procedures

Stored Procedures are pre-compiled SQL statements stored in a database that can be executed on-demand. They are highly useful for encapsulating logic, improving code reusability, and enhancing security. In a data science context, you may use stored procedures to automate common data transformation or analysis tasks.

Creating Stored Procedures: Automating Data Analysis

Suppose you often need to find the average runs scored by players in different series in our sample cricket dataset. Instead of writing the SQL query every time, you can encapsulate it in a stored procedure.

Here’s how you could create such a stored procedure:

DELIMITER //
CREATE PROCEDURE GetAverageRunsBySeries(seriesID INT)
BEGIN
SELECT player_id, AVG(runs_scored) AS average_runs
FROM Matches
WHERE series_id = seriesID
GROUP BY player_id;
END;
//
DELIMITER ;

To call this stored procedure:

CALL GetAverageRunsBySeries(1);

Triggers

Triggers are special types of stored procedures that run automatically when an event occurs in the database, such as INSERT, UPDATE, or DELETE. Triggers can be used for enforcing business rules, validating input data, or keeping audit trails.

Creating Triggers: Maintaining Data Integrity

Imagine you want to ensure that no player can have runs scored below zero in our sample ‘Matches’ table. You could use a trigger to enforce this rule.

Here’s how to create such a trigger:

DELIMITER //
CREATE TRIGGER CheckRunsScored
BEFORE INSERT ON Matches
FOR EACH ROW
BEGIN
IF NEW.runs_scored < 0 THEN
SET NEW.runs_scored = 0;
END IF;
END;
//
DELIMITER ;

Now, if you try to insert a record with negative runs, the trigger will automatically set it to zero:

INSERT INTO Matches VALUES (4, 1, 1, -10);

After this insert, the runs_scored will be stored as 0, thanks to the trigger.

Stored Procedures and Triggers are powerful features in SQL that can greatly assist in automating repetitive tasks and maintaining data integrity, respectively. In a data science environment, these can be particularly useful for automating data analysis pipelines and ensuring the quality of the data you’re working with.

By leveraging these advanced SQL features, you can bring more automation and reliability into your data manipulation and analysis tasks, using our cricket series dataset as a practical example.

Section 6: Optimization Tips for Complex Queries

Query optimization is the process of improving query performance to minimize various costs such as CPU time, memory usage, and I/O operations. For data scientists, especially those dealing with big data, optimizing queries is critical for efficient data processing. Whether you are running a simple SELECT query or a complex join involving multiple tables and conditions, performance tuning should be a priority to make the most out of your computational resources.

Using EXPLAIN: Understanding Query Execution

The EXPLAIN keyword in SQL helps you understand how the database engine plans to execute your query. This is invaluable for identifying bottlenecks or inefficiencies.

For example, consider a query to find the average runs scored in the ‘Ashes 2023’ series from our sample dataset:

EXPLAIN SELECT AVG(runs_scored) 
FROM Matches WHERE series_id = 1;

The output of this command will give you insights into whether the query will utilize indexes, how many rows will be scanned, and other valuable information.

Leveraging Subqueries: Complex Aggregations

Sometimes you need to perform aggregations on top of other aggregations. Subqueries can be highly useful in such scenarios.

For instance, if you want to find the highest average runs scored by players in the ‘Ind vs Eng 2023’ series, you can use a subquery like so:

SELECT MAX(avg_runs) FROM 
(
SELECT player_id, AVG(runs_scored) as avg_runs
FROM Matches
WHERE series_id = 2
GROUP BY player_id
) AS subquery;

Avoid SELECT * : Be Specific

When you use SELECT *, you’re asking the database to pull every column from the table. If you only need a subset of the columns, specify them explicitly. This reduces I/O costs and speeds up data retrieval.

For example, if you only need player_id and runs_scored from the 'Matches' table, then query like this:

SELECT player_id, runs_scored 
FROM Matches;

Use WHERE Wisely: Filter Early

When you’re dealing with large datasets, it’s beneficial to reduce the size of the data as early as possible. By applying filters in the WHERE clause effectively, you can minimize the amount of data that needs to be scanned.

For example, if you’re only interested in matches from the ‘Ashes 2023’ series, filter it right in the WHERE clause:

SELECT player_id, COUNT(*) 
FROM Matches
WHERE series_id = 1
GROUP BY player_id;

Query optimization is essential in the realm of data science, especially when dealing with large or complex datasets like our sample cricket series. Whether it’s utilizing the EXPLAIN keyword to understand the query execution plan, leveraging subqueries for complex aggregations, or being smart about your SELECT and WHERE clauses, each technique can contribute to faster and more efficient data processing.

By applying these optimization strategies, you’ll be well-equipped to handle the complexities that come with big data and advanced analytics, making you a more proficient data scientist.

Conclusion

Advanced SQL techniques offer a powerful toolkit for data scientists to perform intricate data manipulation and extraction tasks efficiently. From subqueries to window functions, from CTEs to indexing, mastering these advanced SQL features will significantly up your data science game. As we continue our journey in the data science realm, remember: complexity is just simplicity waiting to be understood. So keep querying and keep exploring!

This framework should give you a good starting point for your next blog post. Each section could be its own mini-tutorial, complete with code examples using the cricket series dataset to maintain consistency and engagement. Feel free to dive deep into each topic to provide as much value as possible to your readers.

#datascience #datascientist #SportsAnalytics #SQL

Thanks for reading this till the end! Don’t forget to add your feedback in the comment!

Please read the first blog of this series: SQL Essentials for Future Data Scientists

Let’s connect on Linkedin.com/in/jhakamal to collaborate on projects!

https://www.linkedin.com/in/jhakamal/
https://www.linkedin.com/in/jhakamal/

--

--

DataGrad
DataGrad

Written by DataGrad

A Technical Consultant, with expertise in Data Analysis and Visualization. Open to work!

No responses yet