Hello everyone, in the last article of Sqoop Import we have seen how can we use this command to transfer single table to HDFS. There are cases when we do not want all the data from the table or not all the columns. In such cases, we can use a filter and/or free-form of queries. There is also a possibility when a table has no primary key. In that case, we need to tell Sqoop how it can split data between mappers. Let us see how can we use that in practice.

Sqoop import with filter:

Consider a case where we want to filter rows from the table while importing it to HDFS. Or we do not want to bring all the columns from the table. We can use Sqoop arguments to get this job done.

For this demonstration, I am using 1 mapper for simplicity. Following are the content of departments table in MySQL.

 

Sqoop import 2 img1

 

Now we want to bring first 5 departments only. We can specify this is where clause in Sqoop Import.

In the output, we can see that only first 5 departments have been transferred to HDFS. This is because we have specified that in where clause in import command.

 

sqoop import 2 img 2

 

Sqoop import has columns argument which we can use to select the columns that we need to transfer to HDFS. This is useful when we don’t want entire table columns to be transferred to HDFS.

Above command will bring only dept_no column from departments table.

 

Sqoop import with free-form queries:

We have a way to give queries instead of table names to Sqoop Imports. This is very useful when we want to perform some aggregation or join before transferring data to HDFS. Consider the case, we want to bring employees and their respective departments as well. We need to perform join employees, departments and dept_emp table.

When giving free-form of query we need to give target directory so that Sqoop knows where to keep data in HDFS. We also need to give ‘WHERE $CONDITIONS’ clause in a query. This is Sqoop syntax and it helps Sqoop split our data to different mappers. Interestingly we are using only one mapper in this case. If we try to use more than one Sqoop job will fail. Let us see why is that and how to get around it.

Sqoop Import with split by:

As we have mentioned in the first article, Sqoop uses primary key from table to split data in different mappers. But what if the table does not have a primary key or it has a composite primary key or even in free-form of query there is no concept of primary key. In all these cases, Sqoop cannot split data to mappers and if we use more than 1 mapper we will get an error.

But using 1 mapper is not efficient. So in such cases, we have to tell Sqoop which column it should use to split data. This is done by split by argument. We should choose column such that it has a fair distribution of values so Sqoop can split data equally.  Let us take above example and run it with split by argument.

Here is the output of this query. We can see that now our data is split into 4 files. This is due to we have suggested Sqoop to split our data using emp_no column in the employee’s table.

 

sqoop import 2

This is how we can use Sqoop Import command to transfer data to HDFS. We can use different arguments as per our needs. In next article, we will see how we can use Sqoop to imports table directly to Hive.

I am passionate about data analytics, machine learning, and artificial intelligence. Recently I have started blogging about my experience while learning these exciting technologies.

Leave a Reply

Close Menu