Hello everyone! I’m Dhyuthidhar Saraswathula. If you’re new here, I write blogs on Data Science and Computer Science topics.
So, buckle up! Today, we’re diving into an essential data science concept: Exploratory Data Analysis (EDA). In SQL, EDA means using queries to explore, summarize, and understand data. It's the first critical step in any data project, giving us insights and clarity before deeper analysis or modeling.
Why Do We Need EDA?
EDA helps us answer several important questions about our data:
Data Structure: What’s the structure of our data?
Data Quality: Are there any missing values or outliers?
Value Distribution: What’s the distribution of values in each column?
Relationships: What correlations or relationships exist between variables?
Fun Fact: The term "Exploratory Data Analysis" was popularized by John Tukey in his 1977 book Exploratory Data Analysis, where he emphasized the importance of visualizations like boxplots and scatterplots to understand data.
Why Is EDA Growing So Fast?
A few reasons drive the rapid growth of EDA:
Technological Advancements: Computing tools have evolved, making data analysis more accessible.
Big Data: We now have vast amounts of data across fields that require effective analysis.
Cross-Disciplinary Demand: EDA is valuable in finance, healthcare, marketing, and beyond.
Data Sources in SQL
Most SQL data comes in structured tables. In real-world scenarios, however, data often starts unstructured—like sensor measurements, images, or text—and needs to be organized into a structured format to perform SQL-based analysis.
Types of Data in SQL
Understanding data types is crucial as they impact the analyses and visualizations we can perform. In SQL, we usually work with:
Numerical Data
Continuous: Values that can take any number within a range (e.g., salary, age).
Discrete: Countable values (e.g., number of orders).
Categorical Data
Binary: Data with two categories (e.g., 0/1 for yes/no).
Ordinal: Data with a specific order (e.g., ratings like 1-5).
Knowing these types allows us to choose the right SQL functions and analysis techniques.
Basic SQL EDA Techniques
Let's explore some SQL functions for EDA:
COUNT()
UseCOUNT()
to understand the quantity of data. If we want to count the number of students in the Computer Science department, we’d write:SELECT COUNT(*) AS Count_of_Students FROM Students WHERE Department = 'Computer Science';
SUM()
UseSUM()
to get the total of a numeric column. To find the total revenue in a Transactions table:SELECT SUM(Amount) AS Total_Revenue FROM Transactions;
AVG()
Calculate averages withAVG()
. To find the average age of students in the Computer Science department:SELECT AVG(Age) AS Average_Age FROM Students WHERE Department = 'Computer Science';
MIN()
andMAX()
These functions return the smallest and largest values. To find the youngest and oldest students in the Computer Science department:SELECT MIN(Age) AS Youngest_Age, MAX(Age) AS Oldest_Age FROM Students WHERE Department = 'Computer Science';
Grouping Data with GROUP BY
GROUP BY
allows us to aggregate data by specific groups. To get the number of students in each department:
SELECT Department, COUNT(*) AS Student_Count
FROM Students
GROUP BY Department;
Filtering Aggregated Data with HAVING
The HAVING
clause filters data after grouping. To find departments with more than 50 students:
SELECT Department, COUNT(*) AS Student_Count
FROM Students
GROUP BY Department
HAVING COUNT(*) > 50;
Note: WHERE
filters rows before grouping, while HAVING
filters grouped results.
Summary of EDA Insights
Using SQL, we can quickly gain critical insights like:
Total Records: Using
COUNT()
Summing Values: Using
SUM()
Averages: Using
AVG()
Ranges: Using
MIN()
andMAX()
These insights lay the foundation for deeper analysis and help us understand our data’s characteristics before applying complex models.
Final Thoughts
That wraps up our introduction to EDA in SQL! EDA in SQL is a powerful way to explore structured data and gain essential insights. In my next blog, I’ll dive into using advanced SQL functions and joins for even richer data exploration. Stay tuned!