Hello everyone. We have used GROUP BY operation to perform aggregations in our queries. Consider the case where we have data with retail store inventory. Every month, we have shipped products to different stores with different product types like clothing, home appliances, etc. Now we want to calculate how many products have we shipped to each store according to product types as well as total products shipped irrespective of its type. This will not be possible in a single group by statement. This is where GROUPING SETS, ROLLUP, and CUBE come in to picture.

GROUPING SETS :

Consider we have data in the following format.

Existing data

Now we need to calculate total products shipped to store in each category as well as total products across all product types. We can use GROUPING SETS for this.

This query will give us output in the following format.

GROUPING SET output

We have calculated the number of products shipped to each store for each product type as well as the total number of products shipped to that store. Where the product_type column is null we have the total sum of products sent to that store across all product types. For achieving the same result with GROUP BY, we would need to write the following query. We can see that this query is clearly inefficient as it reads an entire table twice to get us the result which GROUPING SET can do it one scan.

Consider another example,

To get the same result with GROUP BY we will need the following query.

We can use GROUPING_SET to get all products shipped from inventory to all stores and of all product types. If we provide blank set in GROUPING SET clause we will get total products shipped from this inventory.

Grouping set output

Here, the first row shows us the total number of products shipped to all stores and of all product types. We can pivot required values to another column instead of a row in the output. To know how we can pivot rows to columns you can check Pivot rows to columns in Hive

GROUPING__ID:

As we can see, while using GROUPING SETS we have columns with null values. It may be confusing sometimes to know which group is used to perform aggregation. We can use GROUPING__ID function to know which group is used for that aggregation results.

Grouping_id output

In this case,

0 = no group

1 = aggregation performed over one column i.e stored_id

3= aggregation performed over both column i.e store_id, product_type

ROLLUP and CUBE:

ROLLUP AND CUBE are functions built on GROUPING SETS.

In the case of ROLLUP,

This is equivalent to the following GROUPING SET query.

CUBE considers all possible combination of columns present in GROUP BY clause.

This is equivalent to the following GROUPING SET query.

This is how we can perform enhanced aggregations in Hive which are optimized than normal GROUP BY clause.

Leave a Reply

Close Panel