Introduction
Welcome to the first installment in our series designed to turn you into a data science maestro! But before we dive into the nitty-gritty of SQL, let’s take a moment to understand what data science is and what a data scientist does.
Data science is an interdisciplinary field that leverages a variety of techniques, algorithms, processes, and systems to extract knowledge and insights from structured and unstructured data. It’s like being a detective for the digital age, where your clues are buried in rows and columns of data, and your deductions could lead to actionable insights or transformative business strategies.
Data scientists come in many flavors but generally are adept at mathematics, statistics, and programming. They utilize these skills to analyze data, discover patterns, and make data-driven decisions. They are the architects behind recommendation systems, predictive models for healthcare, real-time analytics for finance, and even strategies for optimizing various functionalities in diverse sectors.
So, whether you aim to build machine learning models or focus on analytic reports, SQL is a foundational skill you’ll need. Today, we’ll begin our journey into the world of data science by diving deep into SQL through the lens of an engaging subject — cricket series data.
What You’ll Learn
Basics of SQL queries
Filtering and Sorting Data
Aggregating Data
SQL Joins
Section 1: Why SQL for Data Science?
Structured Query Language (SQL) is more than just a tool; it’s a linchpin for anyone working in data science. Here’s why:
Data Retrieval
- One of the most common tasks you’ll perform as a data scientist is data extraction. SQL is the industry standard for fetching data from databases, where most of the world’s digital information is stored. You could be dealing with transaction records, customer data, or even larger datasets like social network activities; SQL is the go-to for data retrieval.
Data Manipulation
- SQL isn’t just for fetching data; it’s also highly effective for data manipulation tasks. You can update records, delete data points, or even rearrange entire tables, all through concise SQL commands.
Data Analysis
- SQL comes packed with numerous built-in functions for data analysis. From simple calculations like averages and sums to more complex statistical measures, SQL allows you to run these analyses directly on the database, reducing the need to pull massive datasets into external software, saving both time and computational resources.
Versatility
- SQL skills are highly transferable. Whether you’re working in the cloud, utilizing a NoSQL database, or even operating within big data platforms like Hadoop, a solid understanding of SQL principles will stand you in good stead.
Data Integration
- Data scientists often work with disparate data sources, and SQL excels at joining these through complex queries. Whether you’re merging tables or combining multiple data sources, SQL simplifies this process, making it easier to create holistic data views for analysis.
Big Data and Real-time Analytics
- As the world increasingly moves towards real-time data analytics and big data, SQL remains at the core. Many big data solutions like Hive and Spark SQL are modeled around SQL. It’s a skill that scales with the technology landscape, making it future-proof to a large extent.
By mastering SQL, you’re not just adding another skill to your toolkit; you’re laying the foundation for effective and efficient data handling. For data scientists, SQL is not an option; it’s a necessity.
Section 2: Setting Up the Sample Dataset
Creating a sample dataset is a crucial step in any data science tutorial. It serves as a sandbox, allowing you to get hands-on experience with SQL queries and operations. For this blog, we’re going with a cricket series dataset to make learning SQL more engaging and relatable.
The Tables
Our dataset will consist of three tables: Series
, Players
, and Matches
.
- Series Table: This table captures different cricket series, like ‘Ashes 2023’ or ‘Ind vs Eng 2023’. It will help us understand how to link data across tables.
- Players Table: Information about individual players, such as names and teams, will be stored here. This table serves to showcase basic querying techniques.
- Matches Table: This table stores the performance details of players in different matches within a series, enabling us to learn about aggregation and joins.
Why These Tables?
The reason for choosing these tables is to cover a broad spectrum of SQL operations that data scientists frequently perform. They range from basic SELECT queries and WHERE clauses to more advanced operations like JOINs and aggregations.
Creating Tables
To create these tables, execute the following SQL commands:
CREATE TABLE Series(
series_id INT,
series_name TEXT);
CREATE TABLE Players(
player_id INT,
player_name TEXT,
team TEXT);
CREATE TABLE Matches(
match_id INT,
series_id INT,
player_id INT,
runs_scored INT);
Populating Tables
Once the tables are created, populate them with sample data:
-- Series Data
INSERT INTO Series VALUES (1, 'Ashes 2023');
INSERT INTO Series VALUES (2, 'Ind vs Eng 2023');
-- Players Data
INSERT INTO Players VALUES (1, 'Virat Kohli', 'India');
INSERT INTO Players VALUES (2, 'Ben Stokes', 'England');
INSERT INTO Players VALUES (3, 'Steve Smith', 'Australia');
INSERT INTO Players VALUES (4, 'Joe Root', 'England');
-- Matches Data
INSERT INTO Matches VALUES (1, 1, 3, 55);
INSERT INTO Matches VALUES (1, 1, 4, 45);
INSERT INTO Matches VALUES (2, 1, 3, 20);
INSERT INTO Matches VALUES (2, 1, 4, 60);
INSERT INTO Matches VALUES (3, 2, 1, 30);
INSERT INTO Matches VALUES (3, 2, 2, 50);
Creating and populating these tables set the stage for us to explore various SQL operations. By the end of this tutorial series, you’ll not only understand SQL syntax but also gain insights into how to perform complex queries efficiently.
Section 3: Basic SQL Queries
Mastering the basics is the first step towards SQL proficiency. In this section, we’ll go through some of the fundamental SQL queries that every data scientist should know, using our cricket series dataset as an example.
Fetching All Records
The SELECT *
query allows you to retrieve all columns and records from a table. It's useful for quickly examining your dataset, though it can be resource-intensive on large tables.
SELECT *
FROM Players;
Fetching Specific Columns
If you are interested in specific columns, you can specify them in the SELECT
statement. This reduces the amount of data fetched and improves performance.
SELECT player_name, team
FROM Players;
Filtering Records with WHERE
The WHERE
clause enables you to filter records based on certain conditions, making your queries more focused.
SELECT *
FROM Players
WHERE team = 'India';
Combining Filters with AND & OR
You can combine multiple conditions using AND
and OR
to fetch more specific data.
SELECT *
FROM Players
WHERE team = 'India' AND player_name = 'Virat Kohli';
Using IN and BETWEEN
To filter records that belong to a set of values, you can use the IN
keyword. Similarly, BETWEEN
allows you to select values within a range.
SELECT *
FROM Players
WHERE player_id IN (1, 3);
SELECT *
FROM Matches
WHERE runs_scored BETWEEN 30 AND 60;
Limiting Results
If you want to limit the number of results returned, use the LIMIT
keyword. This is particularly useful for queries on large datasets.
SELECT *
FROM Players
LIMIT 3;
Counting Records The COUNT
function allows you to get the number of records that match a certain condition.
SELECT COUNT(*)
FROM Players
WHERE team = 'India';
By understanding these basic SQL queries, you lay the groundwork for more complex operations and analyses. Each type of query serves as a building block that will be crucial for your future work in data science.
Section 4: Filtering and Sorting
Data seldom comes in the exact form you need; often, you’ll want to filter and sort it to better fit your analytical requirements. This section will deepen your understanding of how to refine your queries for more targeted results, using our cricket series dataset.
Filtering with Multiple Conditions
Filtering is not limited to single conditions. You can use logical operators like AND
and OR
in combination with relational operators to refine your filters. Here's how you can get records of players who scored more than 50 runs in the 'Ashes 2023' series.
SELECT *
FROM Matches
WHERE series_id = 1 AND runs_scored > 50;
Using LIKE for Partial Matching
The LIKE
operator allows you to perform partial string matching, useful for text-based searches. For instance, to find all players with names starting with 'S', you can do:
SELECT *
FROM Players
WHERE player_name LIKE 'S%';
Sorting with ORDER BY
When your data needs to be arranged in a certain order, ORDER BY
is your go-to SQL statement. By default, it sorts in ascending order but you can specify DESC
for descending.
SELECT *
FROM Players
ORDER BY player_name DESC;
Combining Filtering and Sorting Filtering and sorting can be combined for highly specific queries. For example, if you want to find all players from ‘India’ and sort them by their names:
SELECT *
FROM Players
WHERE team = 'India'
ORDER BY player_name;
Using DISTINCT for Unique Values
If you want to remove duplicate records and get a list of unique items, the DISTINCT
keyword is quite useful. For instance, to get the unique teams from the Players table:
SELECT DISTINCT team
FROM Players;
Aggregate Functions in Filtering
You can also incorporate aggregate functions like SUM
, AVG
, and MAX
in your queries to derive statistical insights directly from your data. For example, to find the highest runs scored in the 'Ashes 2023' series:
SELECT MAX(runs_scored)
FROM Matches
WHERE series_id = 1;
Understanding how to effectively filter and sort data is essential for data manipulation and subsequent analysis. As you progress in your data science career, these SQL techniques will be invaluable in efficiently extracting meaningful information from vast datasets.
Section 5: Aggregating Data
Aggregation is the process of performing calculations on a set of values to get a single, summarized result. In data science, this is invaluable for understanding trends, making comparisons, or drawing insights from raw data. We’ll explore key aggregation functions using our cricket series dataset.
Counting Records with COUNT
The COUNT
function allows you to determine the number of records that meet specific conditions.
SELECT COUNT(*)
FROM Matches
WHERE runs_scored > 50;
Summation with SUM
To find the total of a numeric column, you can use the SUM
function. This is particularly useful in finance, statistics, or sports analytics.
SELECT SUM(runs_scored)
FROM Matches
WHERE series_id = 1;
Averaging Values with AVG
Finding the mean value of a set of numbers is a common statistical operation, accomplished with the AVG
function.
SELECT AVG(runs_scored)
FROM Matches
WHERE series_id = 2;
Finding Minimum and Maximum with MIN and MAX
Identifying the smallest or largest value in a dataset is a breeze with MIN
and MAX
.
SELECT MIN(runs_scored), MAX(runs_scored)
FROM Matches;
Grouping Data with GROUP BY
Often, you’ll want to apply aggregation functions to distinct categories or groups in your data. The GROUP BY
clause lets you do this effortlessly.
SELECT series_id, AVG(runs_scored)
FROM Matches
GROUP BY series_id;
Combining Aggregation with Filtering
You can combine WHERE
clauses with your aggregation functions to focus your summaries on specific subsets of data.
SELECT series_id, COUNT(*)
FROM Matches
WHERE runs_scored > 30
GROUP BY series_id;
Utilizing HAVING for Filtered Aggregation
When you need to filter based on the result of an aggregation, use the HAVING
clause in conjunction with GROUP BY
.
SELECT series_id, AVG(runs_scored)
FROM Matches
GROUP BY series_id
HAVING AVG(runs_scored) > 40;
Aggregation is an indispensable tool for data analysis, enabling you to summarize complex datasets and derive actionable insights. As you continue to delve deeper into SQL and data science, these aggregation methods will become fundamental components of your data manipulation and analytics toolbox.
Section 6: SQL Joins
SQL joins allow you to combine rows from two or more tables based on a related column between them. This is indispensable in scenarios where the data is distributed across different tables but needs to be analyzed together. We’ll explore the most commonly used types of joins using our cricket series dataset.
Inner Join
An inner join fetches rows that have matching values in both tables. This is the most commonly used type of join.
SELECT Players.player_name, Matches.runs_scored
FROM Players
INNER JOIN Matches
ON Players.player_id = Matches.player_id;
Left (Outer) Join
A left join retrieves all records from the left table, and the matching records from the right table. If no match is found, NULL values are returned for right table’s columns.
SELECT Players.player_name, Matches.runs_scored
FROM Players
LEFT JOIN Matches
ON Players.player_id = Matches.player_id;
Right (Outer) Join
A right join is similar to a left join but retrieves all records from the right table.
SELECT Players.player_name, Matches.runs_scored
FROM Players
RIGHT JOIN Matches
ON Players.player_id = Matches.player_id;
Full (Outer) Join
A full join combines rows from both tables, returning NULL for non-matching rows from either table.
SELECT Players.player_name, Matches.runs_scored
FROM Players
FULL JOIN Matches
ON Players.player_id = Matches.player_id;
Cross Join
A cross join returns the Cartesian product of both tables, meaning each row from the first table is combined with each row from the second table.
SELECT Players.player_name, Series.series_name
FROM Players
CROSS JOIN Series;
Self Join
A self join allows you to join a table with itself. This is particularly useful for hierarchical or ordered data stored within the same table.
SELECT A.player_name, B.player_name
FROM Players A, Players B
WHERE A.team = B.team AND A.player_id != B.player_id;
Using Joins in Aggregated Queries
Combining joins with aggregate functions can produce highly insightful results, particularly when analyzing data from multiple tables.
SELECT Players.team, AVG(Matches.runs_scored)
FROM Players
INNER JOIN Matches
ON Players.player_id = Matches.player_id
GROUP BY Players.team;
Understanding how to execute and leverage different types of SQL joins is pivotal for complex data analysis and transformation tasks. Mastering joins will significantly widen your SQL toolkit, allowing you to tackle more intricate questions and problems in your data science career.
Conclusion
In the realm of data science, SQL stands as an invaluable skill that goes beyond mere data retrieval; it is the cornerstone for data manipulation, transformation, and analysis. Throughout this blog, we’ve traversed a journey from understanding why SQL is indispensable for data science, to setting up a sample cricket series dataset, and then diving into the fundamentals like basic queries, filtering, sorting, and data aggregation. We also explored the power of SQL joins, a crucial element for handling complex queries across multiple tables.
The use of a sample cricket dataset aimed to add a real-world touch, showing that the applications of SQL in data science are endless. Whether it’s sports analytics, healthcare, finance, or any other sector, SQL enables you to ask complex questions and get answers that drive decisions.
Remember, the journey of mastering SQL is ongoing. Each query you write will not only fetch you the data you seek but also add to your understanding of how databases think. As you become more accustomed to thinking in SQL, you’ll find it an increasingly powerful tool in your data science arsenal.
So keep querying, keep exploring, and remember: in the data world, your curiosity is your best asset. As we’ve seen, SQL is a robust tool that can answer a multitude of questions, limited only by the queries you’re willing to ask.
#datascience #datascientist #SportsAnalytics #SQL
Thanks for reading this till the end! Don’t forget to add your feedback in the comment!
Read next piece of this series “Advanced SQL Techniques for Data Scientists: A Deep Dive”
Let’s connect on Linkedin.com/in/jhakamal to collaborate on projects!