The GROUP BY Statement

tags: #sql/basic_statements

The GROUP BY statement groups together rows in a table based on a specified column or columns, and then the aggregate function can be applied to the values in each group.

SELECT column_name(s), aggregate_function(column_name)
FROM table_name
WHERE condition
GROUP BY <grouping attribute>
ORDER BY <ordering attribute>;

This groups the result rows into partitions, based on their values in one or several columns.

To group by multiple attributes:

...
GROUP BY column1, column2,...
...

The result of a query using a GROUP BY statement contains one row for each group.

Example:

SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department;

This computes the average salaries of employees for each department.

Filtering using `HAVING`

Aggregate Function Required

The clause only works with the GROUP BY that also involve an aggregate function (e.g., SUM, AVG, COUNT, MAX, MIN) specified in the SELECT clause.

We can filter groups using the HAVING clause based on the specified condition(s):

SELECT column1, column2, ..., aggregate_function(column) AS result
FROM table
WHERE condition
GROUP BY column1, column2, ...
HAVING condition; // condition based on aggregate function

You can combine GROUP BY with MIN() / MAX() and filter groups using HAVING :

SELECT
  category,
  MIN(price) AS min_price,
  MAX(price) AS max_price
FROM products
GROUP BY category
HAVING MIN(price) >= 10
   AND MAX(price) <= 100;

This keeps only categories where the cheapest item is at least 10 and the most expensive item is at most 100.

Extensions to the `Group By` Clause

What is this?

In SQL, GROUP BY is used for aggregating data. However, sometimes, we need more advanced summarization across multiple dimensions. This is where CUBE and ROLLUP come in.

Taking this dataset as an example:

         Date     Category  Sales
0  2024-03-01  Electronics   1000
1  2024-03-01    Clothing    500
2  2024-03-02  Electronics   1200
3  2024-03-02    Clothing    700

1. `Rollup()` (Hierarchical Aggregation)

Generates subtotal and grand total along a hierarchy.
Computes cumulative aggregation from detailed rows up to the highest level (i.e., at each level in the hierarchy, the aggregation (e.g., sum, average) includes the values from the lower levels)

Example:

SELECT 
	Date
	, Category
	, SUM(Sales) AS TotalSales
FROM sales_data
GROUP BY ROLLUP(Date, Category);

What Happens?

Groups data by (Date, Category):

This is the standard GROUP BY behaviour where the data is grouped first by Date and then by Category.
No difference from what you'd expect with a normal group by Date, Category

Generate subtotals by aggregating across categories within each date.

For instance, if we take 2024-03-01, the data is grouped into: Electronics and Clothing
SUM(Sales) will be computed for both of these, and then a subtotal (the sum of these sums) will be generated for the entire date 2024-03-01

Cumulative Aggregation

The grand total is the final aggregation across all of the data, regardless of Date or Category. It's a total across everything.

Result set:

| Date       | Category    | TotalSales |
|------------|-------------|------------|
| 2024-03-01 | Electronics | 1000       |
| 2024-03-01 | Clothing    | 500        |
| 2024-03-01 | **NULL**    | **1500**   ← (*Subtotal for 2024-03-01*) |
| 2024-03-02 | Electronics | 1200       |
| 2024-03-02 | Clothing    | 700        |
| 2024-03-02 | **NULL**    | **1900**   ← (*Subtotal for 2024-03-02*) |
| **NULL**   | **NULL**    | **3400**   ← (*Grand Total*) |

2. `Cube()` (Aggregation across all combinations of columns)

The CUBE() function is an extension of the GROUP BY clause in SQL that computes aggregations for all possible combinations of the specified columns in the clause, including individual dimensions.

SELECT 
	Date
	, Category
	, SUM(Sales) AS TotalSales
FROM sales_data
GROUP BY CUBE(Date, Category);

| Date       | Category    | TotalSales |
|------------|-------------|------------|
| 2024-03-01 | Electronics | 1000       |
| 2024-03-01 | Clothing    | 500        |
| 2024-03-01 | **NULL**    | **1500**   ← (*Subtotal for 2024-03-01*) |
| 2024-03-02 | Electronics | 1200       |
| 2024-03-02 | Clothing    | 700        |
| 2024-03-02 | **NULL**    | **1900**   ← (*Subtotal for 2024-03-02*) |
| **NULL**   | Electronics | **2200**   ← (*Subtotal for Electronics*) |
| **NULL**   | Clothing    | **1200**   ← (*Subtotal for Clothing*) |
| **NULL**   | **NULL**    | **3400**   ← (*Grand Total*) |

All Possible Combinations

Aggregations for each Date + Category combination.
Aggregations for each Date alone (ignoring Category).
Aggregations for each Category alone (ignoring Date).
A grand total (ii.e., total sum of the entire dataset)

3. Grouping Sets (for custom aggregation)

When you need multiple levels of aggregation across various columns, GROUPING SETS allows you to perform these calculations in one query.

SELECT Date, Category, SUM(Sales) AS TotalSales
FROM sales_data
GROUP BY GROUPING SETS (
    (Date, Category),
    (Date),
    (Category),
    () -- Grand Total
);

Result set:

			| Date       | Category    | TotalSales |
			|------------|-------------|------------|
D,C			| 2024-03-01 | Electronics | 1000       |
			| 2024-03-01 | Clothing    | 500        |
			| 2024-03-02 | Electronics | 1200       |
			| 2024-03-02 | Clothing    | 700        |
D			| 2024-03-01 | **NULL**    | 1500       ← (*Subtotal for 2024-03-01*) |
			| 2024-03-02 | **NULL**    | 1900       ← (*Subtotal for 2024-03-02*) |
C			| **NULL**   | Electronics | 2200       ← (*Subtotal for Electronics*) |
			| **NULL**   | Clothing    | 1200       ← (*Subtotal for Clothing*) |
			| **NULL**   | **NULL**    | 3400       ← (*Grand Total*) |

Filtering using HAVING

Extensions to the Group By Clause

1. Rollup() (Hierarchical Aggregation)

2. Cube() (Aggregation across all combinations of columns)

3. Grouping Sets (for custom aggregation)

Filtering using `HAVING`

Extensions to the `Group By` Clause

1. `Rollup()` (Hierarchical Aggregation)

2. `Cube()` (Aggregation across all combinations of columns)