Installing and using Apache sqoop to export/Import large datasets (MySQL, S3) (CSV/TSV..) on EMR cluster

Often times the export/import activity may be limited on several performance bottlenecks. So, the activity may be faster if a distributed transfer is used instead of normal transfer. Some of the bottlenecks include Read Throughput , Write throughput , how the code parses the data(Inline or Batch etc,. Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

http://sqoop.apache.org/

This guide shows you > To Install sqoop and > Export/Import MySQL tables (from S3 to RDS) ,(from RDS to S3) respectively.

Considering a sample MySQL Table in RDS

> mysql -h myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com -u mannem -p

mysql> describe dailyStockPrices;

+----------+--------------+------+-----+-------------------+-------+
| Field    | Type         | Null | Key | Default           | Extra |
+----------+--------------+------+-----+-------------------+-------+
| symbol   | varchar(10)  | YES  |     | NULL              |       |
| name     | varchar(100) | YES  |     | NULL              |       |
| ask      | varchar(10)  | YES  |     | NULL              |       |
| bid      | varchar(10)  | YES  |     | NULL              |       |
| daysLow  | varchar(10)  | YES  |     | NULL              |       |
| daysHigh | varchar(10)  | YES  |     | NULL              |       |
| ts       | timestamp    | YES  |     | CURRENT_TIMESTAMP |       |
+----------+--------------+------+-----+-------------------+-------+
7 rows in set (0.00 sec)

mysql> SELECT table_name AS “Table”, -> round(((data_length + index_length) / 1024 / 1024), 2) “Size in MB” -> FROM information_schema.TABLES -> WHERE table_schema = “pipetest” -> AND table_name = “dailyStockPrices”;

+--------------------+------------+
| Table            | Size in MB   |
+--------------------+------------+
| dailyStockPrices | 5373.00      |
+--------------------+------------+
1 row in set (0.00 sec)

My MySQL table dataset has lot of commas in the fields, so I choose TSV format instead of CSV to import/export. If I used CSV format, Sqoop will get confused parsing data.

Sqoop on EMR 4.4.0 + is pre-installed

Starting from EMR AMI version 4.4.0 , Sqoop 1.4.6 is available as sandbox. This can be installed by simply selecting this option while provisioning the EMR cluster. By default, Sqoop on EMR has a MariaDB and PostgresSQL driver installed. To install an alternate set of JDBC connectors for Sqoop, you need to install them in /usr/lib/sqoop/lib.

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-sandbox.html#emr-sqoop

Sqoop on EMR 3.x.x can be installed with the following script:

# Install Apache sqoop on Hadoop Cluster's Master node

#!/bin/bash
#You may put this script in your s3 bucket: s3://<BUCKET_NAME>/install_sqoop.sh
# This script works on all EMR 3.x.x AMI's and can easily be extended for EMR 4.x.x AMI's based on directory strutcture. 
# There is not need to run this script if EMR AMI is 4.4.0 or greater. 

# Download scoop from one of the mirrors.
#A sample tar for hadoop 2 is "sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz" from mirror :
# http://apache.arvixe.com/sqoop/1.4.6/
# And place this tarball in your s3 bucket

hadoop fs -copyToLocal s3://<BUCKET_NAME>/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz .
# wget http://apache.arvixe.com/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz

tar -xzf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
cp sqoop-1.4.6.bin__hadoop-2.0.4-alpha /home/hadoop/

# Download mysql connector library and copy this to your s3 bucket
# A sample library can be found in mirror
# http://mirrors.dotsrc.org/mysql/Downloads/Connector-J/mysql-connector-java-5.1.36.tar.gz

hadoop fs -copyToLocal s3://< bucket_name >/mysql-connector-java-5.1.36.tar.gz .
# wget http://mirrors.dotsrc.org/mysql/Downloads/Connector-J/mysql-connector-java-5.1.36.tar.gz

tar -xzf mysql-connector-java-5.1.36.tar.gz
cp mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar /home/hadoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/lib/

# Add symlink, so that sqoop can be directly invoked from shell instead of absolute path.
sudo ln -s /home/hadoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/bin/sqoop /usr/bin/sqoop


#end

Import Commands:

Import sqoop: This command copies MySQL table from RDS to S3. The S3 file content type is TSV & File name will be in "part-m-00000" format.; Note that with -m 1 , I am using single mapper task to run in parallel.; sqoop import --connect jdbc:mysql://myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com/pipetest --username mannem --password Password123 --table dailyStockPrices --target-dir s3://mannem/sqoopmatrix -m 1 --fields-terminated-by '\t' --lines-terminated-by '\n'
Check S3 contents: hadoop fs -cat s3://mannem/sqoopmatrix/part-m-00000
Sqoop command usage:: http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html

Export commands :

Before export, The Destination MySQL/PSQL table should already be created with a similar schema.

Export to RDS(MySQL): This command copies TSV file from S3 to MySQL Table.; sqoop export --connect jdbc:mysql://myrds.crezaaaruhfx.us-west-2.rds.amazonaws.com/pipetest --username mannem --password Password123 --table dailyStockPrices --input-fields-terminated-by '\t' --input-lines-terminated-by '\n' --export-dir s3://mannem/sqoopmatrix/part-m-00000
Export to Redshift(PSQL): sqoop export --connect jdbc:redshift://$MYREDSHIFTHOST:5439/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver com.amazon.redshift.jdbc41.Driver --username master --password Mymasterpass1
Export commands with mariadb connection string: sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1
Export with using Secure Socket Layer encryption: sqoop export --connect jdbc:mariadb://$HOSTNAME:3306/mydb?verifyServerCertificate=false&useSSL=true&requireSSL=true --table mysqoopexport --export-dir s3://mybucket/myinputfiles/ --driver org.mariadb.jdbc.Driver --username master --password Mymasterpass1