Hello everyone. We have used GROUP BY operation to perform aggregations in our queries. Consider the case where we have data with of retail store inventory. Every month, we have shipped products to different stores with different product types like clothing, home appliances etc. Now we want to calculate that how many products have we shipped to each store according to product types as well as total products shipped irrespective of its type. This will not be possible in a single group by statement. This is where GROUPING SETS, ROLLUP and CUBE come in to picture.

GROUPING SETS :

Consider we have data in the following format.

 

Existing data

 

Now we need to calculate total products shipped to store in each category as well as total products across all product types. We can use GROUPING SETS for this.

This query will give us output in the following format.

 

GROUPING SET output

 

We have calculated the number of products shipped to each store for each product type as well as the total number of products shipped to that store. Where the product_type column is null we have the total sum of products sent to that store across all product types. For achieving the same result with GROUP BY, we would need to write following query. We can see that this query is clearly inefficient as it reads entire table twice to get us the result which GROUPING SET can do it one scan.

Consider another example,

To get the same result with GROUP BY we will need the following query.

We can use GROUPING_SET to get all products shipped from inventory to all stores and of all product types. If we provide blank set in GROUPING SET clause we will get total products shipped from this inventory.

 

Grouping set output

 

Here, the first row shows us the total number of products shipped to all stores and of all product types. We can pivot required values to another column instead of a row in the output. To know how we can pivot rows to columns you can check Pivot rows to columns in Hive

 

GROUPING__ID:

As we can see, while using GROUPING SETS we have columns with null values. It may be confusing sometimes to know which group is used to perform aggregation. We can use GROUPING__ID function to know which group is used for that aggregation results.

 

Grouping__id output

 

In this case,

0 = no group

1 = aggrigation performed over one column i.e stored_id

3= aggrigation performed over both column i.e store_id, product_type

 

ROLLUP and CUBE:

ROLLUP AND CUBE are functions built on GROUPING SETS.

In case of ROLLUP,

This is equivalent to the following GROUPING SET query.

 

CUBE considers all possible combination of columns present in GROUP BY clause.

This is equivalent to the following GROUPING SET query.

 

This is how we can perform enhanced aggregations in Hive which are optimized than normal GROUP BY clause.

 

I am passionate about data analytics, machine learning, and artificial intelligence. Recently I have started blogging about my experience while learning these exciting technologies.

Leave a Reply

Close Menu