Practical Data Science: Rectangular Data and Estimation Techniques in SQL
Hey everyone, welcome back! If you’re new here, I’m Dhyuthidhar Saraswathula. I write blogs focused on computer science, with a special emphasis on data science and analytics.
Today, let’s dive into Rectangular Data and the SQL techniques used to analyze it effectively. Whether you’re a data science student or a corporate employee, mastering SQL to analyze structured data is a critical skill. Let’s dive into the topic.
Rectangular Data in SQL
Rectangular Data refers to data structured into rows and columns, commonly stored in relational databases.
Rows: Represent records or observations (e.g., students or customers).
Columns: Represent features or variables (e.g., age, grades, or purchases).
Examples of Rectangular Data:
Exam Results:
Rows: Students
Columns: Marks for different subjects.
Customer Analytics:
Rows: Customers
Columns: Purchases or demographic details.
Key Concept:
Data is not always in a rectangular format. You may need to preprocess and transform raw, unstructured data into a structured table using SQL.
Key Terminology in SQL Context
Features → Columns in a table.
Records → Rows in a table.
Indexes → Unique identifiers like
Primary Keys
for rows. It makes the performance of querying efficient.
SQL dialects like MySQL, PostgreSQL, and SQLite are perfect for managing rectangular data. Here's how you can analyze such data with SQL.
Central Tendency Analysis in SQL
It is also called the Estimation of Location, which refers to the methods used to find the data's central tendency(estimate value).
SQL provides built-in functions to calculate measures of central tendency like mean, median, weighted mean, and more.
1. Mean Calculation
SELECT AVG(column_name) AS Mean_Value
FROM table_name;
- Example: Find the average marks in an exam.
2. Median Calculation
SQL doesn't have a built-in MEDIAN
function, but you can calculate it using a subquery:
SELECT AVG(middle_values) AS Median_Value
FROM (
SELECT column_name AS middle_values
FROM table_name
ORDER BY column_name
LIMIT 2 - (SELECT COUNT(*) % 2) OFFSET (SELECT (COUNT(*) - 1) / 2)
) AS SubQuery;
There are lot of ways to find median, you can also use window functions.
Example: Find the median salary of employees.
3. Weighted Mean
Use SQL with a calculated weight
column:
SELECT SUM(column_name * weight_column) / SUM(weight_column) AS Weighted_Mean
FROM table_name;
- Example: Calculate weighted average grades, where weights are credit hours.
4. Trimmed Mean
You can exclude outliers by trimming top and bottom values using LIMIT
and OFFSET
:
SELECT AVG(column_name) AS Trimmed_Mean
FROM (
SELECT column_name
FROM table_name
ORDER BY column_name
LIMIT x OFFSET y
) AS TrimmedValues;
- Example: Analyze exam scores after excluding outliers.
Identifying Outliers with SQL
Outliers can skew your analysis. Use SQL to detect them:
Find Outliers Using Standard Deviation
SELECT column_name
FROM table_name
WHERE column_name > (SELECT AVG(column_name) + 2 * STDDEV(column_name) FROM table_name)
OR column_name < (SELECT AVG(column_name) - 2 * STDDEV(column_name) FROM table_name);
- Example: Detect unusually high or low sales figures.
Replace Outliers with NULL
UPDATE table_name
SET column_name = NULL
WHERE column_name > (SELECT AVG(column_name) + 2 * STDDEV(column_name) FROM table_name)
OR column_name < (SELECT AVG(column_name) - 2 * STDDEV(column_name) FROM table_name);
- Example: Replace anomalous customer ages with
NULL
.
SQL Queries for Practical Implementation
Here’s a practical dataset:
Student_ID | Subject | Marks | Weight |
1 | Math | 85 | 3 |
2 | Science | 90 | 4 |
3 | English | 75 | 2 |
4 | History | 80 | 3 |
5 | Art | 100 | 1 |
1. Mean Marks
SELECT AVG(Marks) AS Average_Marks
FROM ExamResults;
2. Median Marks
SELECT AVG(middle_values) AS Median_Value
FROM (
SELECT Marks AS middle_values
FROM ExamResults
ORDER BY Marks
LIMIT 2 - (SELECT COUNT(*) % 2) OFFSET (SELECT (COUNT(*) - 1) / 2)
) AS SubQuery;
3. Weighted Mean
SELECT SUM(Marks * Weight) / SUM(Weight) AS Weighted_Mean
FROM ExamResults;
4. Trimmed Mean
Exclude the highest and lowest 10%:
SELECT AVG(Marks) AS Trimmed_Mean
FROM (
SELECT Marks
FROM ExamResults
ORDER BY Marks
LIMIT 80 OFFSET 10
) AS TrimmedValues;
Final Thoughts
I hope you understand how rectangular data analysis works in SQL. Understanding SQL techniques to analyze rectangular data is fundamental for effective data analysis. From calculating measures of central tendency to handling outliers, SQL offers powerful tools to make your data speak. So now you guys try to practice this same analysis using other data you can find in Kaggle and you can practice queries in HackerRank,w3school etc. Can you write a query to calculate the mode of a dataset? Write your answer in the comments.