Practical Data Science: Rectangular Data and Estimation Techniques in SQL

Practical Data Science: Rectangular Data and Estimation Techniques in SQL

Hey everyone, welcome back! If you’re new here, I’m Dhyuthidhar Saraswathula. I write blogs focused on computer science, with a special emphasis on data science and analytics.

Today, let’s dive into Rectangular Data and the SQL techniques used to analyze it effectively. Whether you’re a data science student or a corporate employee, mastering SQL to analyze structured data is a critical skill. Let’s dive into the topic.


Rectangular Data in SQL

Rectangular Data refers to data structured into rows and columns, commonly stored in relational databases.

  • Rows: Represent records or observations (e.g., students or customers).

  • Columns: Represent features or variables (e.g., age, grades, or purchases).

Examples of Rectangular Data:

  1. Exam Results:

    • Rows: Students

    • Columns: Marks for different subjects.

  2. Customer Analytics:

    • Rows: Customers

    • Columns: Purchases or demographic details.

Key Concept:
Data is not always in a rectangular format. You may need to preprocess and transform raw, unstructured data into a structured table using SQL.


Key Terminology in SQL Context

  • Features → Columns in a table.

  • Records → Rows in a table.

  • Indexes → Unique identifiers like Primary Keys for rows. It makes the performance of querying efficient.

SQL dialects like MySQL, PostgreSQL, and SQLite are perfect for managing rectangular data. Here's how you can analyze such data with SQL.


Central Tendency Analysis in SQL

It is also called the Estimation of Location, which refers to the methods used to find the data's central tendency(estimate value).

SQL provides built-in functions to calculate measures of central tendency like mean, median, weighted mean, and more.

1. Mean Calculation

SELECT AVG(column_name) AS Mean_Value
FROM table_name;
  • Example: Find the average marks in an exam.

2. Median Calculation

SQL doesn't have a built-in MEDIAN function, but you can calculate it using a subquery:

SELECT AVG(middle_values) AS Median_Value
FROM (
    SELECT column_name AS middle_values
    FROM table_name
    ORDER BY column_name
    LIMIT 2 - (SELECT COUNT(*) % 2) OFFSET (SELECT (COUNT(*) - 1) / 2)
) AS SubQuery;
  • There are lot of ways to find median, you can also use window functions.

  • Example: Find the median salary of employees.

3. Weighted Mean

Use SQL with a calculated weight column:

SELECT SUM(column_name * weight_column) / SUM(weight_column) AS Weighted_Mean
FROM table_name;
  • Example: Calculate weighted average grades, where weights are credit hours.

4. Trimmed Mean

You can exclude outliers by trimming top and bottom values using LIMIT and OFFSET:

SELECT AVG(column_name) AS Trimmed_Mean
FROM (
    SELECT column_name
    FROM table_name
    ORDER BY column_name
    LIMIT x OFFSET y
) AS TrimmedValues;
  • Example: Analyze exam scores after excluding outliers.


Identifying Outliers with SQL

Outliers can skew your analysis. Use SQL to detect them:

Find Outliers Using Standard Deviation

SELECT column_name
FROM table_name
WHERE column_name > (SELECT AVG(column_name) + 2 * STDDEV(column_name) FROM table_name)
   OR column_name < (SELECT AVG(column_name) - 2 * STDDEV(column_name) FROM table_name);
  • Example: Detect unusually high or low sales figures.

Replace Outliers with NULL

UPDATE table_name
SET column_name = NULL
WHERE column_name > (SELECT AVG(column_name) + 2 * STDDEV(column_name) FROM table_name)
   OR column_name < (SELECT AVG(column_name) - 2 * STDDEV(column_name) FROM table_name);
  • Example: Replace anomalous customer ages with NULL.

SQL Queries for Practical Implementation

Here’s a practical dataset:

Student_IDSubjectMarksWeight
1Math853
2Science904
3English752
4History803
5Art1001

1. Mean Marks

SELECT AVG(Marks) AS Average_Marks
FROM ExamResults;

2. Median Marks

SELECT AVG(middle_values) AS Median_Value
FROM (
    SELECT Marks AS middle_values
    FROM ExamResults
    ORDER BY Marks
    LIMIT 2 - (SELECT COUNT(*) % 2) OFFSET (SELECT (COUNT(*) - 1) / 2)
) AS SubQuery;

3. Weighted Mean

SELECT SUM(Marks * Weight) / SUM(Weight) AS Weighted_Mean
FROM ExamResults;

4. Trimmed Mean

Exclude the highest and lowest 10%:

SELECT AVG(Marks) AS Trimmed_Mean
FROM (
    SELECT Marks
    FROM ExamResults
    ORDER BY Marks
    LIMIT 80 OFFSET 10
) AS TrimmedValues;

Final Thoughts

I hope you understand how rectangular data analysis works in SQL. Understanding SQL techniques to analyze rectangular data is fundamental for effective data analysis. From calculating measures of central tendency to handling outliers, SQL offers powerful tools to make your data speak. So now you guys try to practice this same analysis using other data you can find in Kaggle and you can practice queries in HackerRank,w3school etc. Can you write a query to calculate the mode of a dataset? Write your answer in the comments.