Sqoopdatamigrationtool

1. SqoopIntroduction

Sqoop is a 用于 in Hadoopecosystem and relationships型datalibrary之间 high 效传输data tool. It supports from relationships型datalibraryimportdata to HDFS, Hive, HBaseetc.Hadoopstoresystem, 也support from Hadoopstoresystemexportdata to relationships型datalibrary.

1.1 Sqoop design目标

High Efficiency: 利用MapReduceparallelprocessingcapacity, implementationlarge-scaledata high 效传输
reliability: support断点续传 and failurerestore
易用性: providing simple commands行界面, 便于userusing
scale性: support many 种datalibrary and Hadoopstoresystem
security性: supportKerberosauthenticationetc.security mechanisms

1.2 Sqoop application场景

dataimport: from relationships型datalibraryimportdata to Hadoopstoresystem
dataexport: from Hadoopstoresystemexportdata to relationships型datalibrary
ETL流程: serving asETL流程一部分, implementationdata 抽取 and 加载
datamigration: in 不同datalibrarysystem之间migrationdata
data集成: 将不同来sources data集成 to 统一 storesystem

2. Sqoopcore concepts

2.1 连接器 (Connector)

连接器 is Sqoop corecomponent, 用于 and 不同 datalibrarysystemfor交互. Sqoop in 置了 many 种连接器, including:

JDBC Connector: throughJDBC连接support many 种datalibrary
MySQL Connector: 专门用于MySQLdatalibrary
PostgreSQL Connector: 专门用于PostgreSQLdatalibrary
Oracle Connector: 专门用于Oracledatalibrary
SQL Server Connector: 专门用于SQL Serverdatalibrary
Netezza Connector: 专门用于IBM Netezzadatalibrary

2.2 import (Import)

import is 指将data from relationships型datalibrary传输 to Hadoopstoresystem 过程. Sqoopsupport以 under import方式:

表import: 将整个表 dataimport to Hadoop
queryimport: 将SQLquery结果import to Hadoop
增量import: 只import new 增加 data
parallelimport: 利用MapReduceparallelprocessingcapacity, 加速dataimport

2.3 export (Export)

export is 指将data from Hadoopstoresystem传输 to relationships型datalibrary 过程. Sqoopsupport以 under export方式:

表export: 将Hadoopin dataexport to relationships型datalibrary表
批量export: 批量processingdata, improvingexportefficiency
parallelexport: 利用MapReduceparallelprocessingcapacity, 加速dataexport

2.4 MapReducejob

Sqoop in 底层usingMapReducejob来implementationdata parallel传输. for 于importoperation, Sqoop会生成一个只package含Maptask MapReducejob; for 于exportoperation, Sqoop也会生成一个只package含Maptask MapReducejob.

3. Sqooparchitecturedesign

3.1 architecture组成

Sqoop architecture主要由以 under component组成:

commands行界面: userthroughcommands行界面 and Sqoop交互
Sqoopcore: processinguserrequest, 生成MapReducejob
连接器management器: management不同datalibrary 连接器
MapReducejob: 执行practical data传输task
Hadoopstoresystem: includingHDFS, Hive, HBaseetc.
relationships型datalibrary: includingMySQL, Oracle, SQL Serveretc.

3.2 workflow程

Sqoop workflow程such as under :

userthroughcommands行界面submittingSqoopcommands
Sqoopcore解析commands, 获取datalibrary连接information and operationparameter
Sqoopcoreusing连接器连接 to relationships型datalibrary
Sqoopcore获取datalibrary表元datainformation
Sqoopcore生成MapReducejobconfiguration
SqoopcoresubmittingMapReducejob to Hadoopcluster
MapReducejob执行data传输task
jobcompletion after , Sqoopcore返回执行结果给user

4. Sqoopinstallation and configuration

4.1 installation before 提

installationSqoop之 before , 需要先installation以 under 软件:

Java JDK 1.8+
Hadoop 2.0+
relationships型datalibrary (such asMySQL, Oracleetc.)

4.2 installation步骤

# 1.  under 载Sqoop
$ wget https://dlcdn.apache.org/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

# 2. 解压Sqoop
$ tar -xzvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
$ mv sqoop-1.4.7.bin__hadoop-2.6.0 /opt/sqoop

# 3. configurationenvironmentvariable
$ export SQOOP_HOME=/opt/sqoop
$ export PATH=$PATH:$SQOOP_HOME/bin

# 4. configurationSqoop
$ cp $SQOOP_HOME/conf/sqoop-env-template.sh $SQOOP_HOME/conf/sqoop-env.sh
$ echo "export HADOOP_COMMON_HOME=/opt/hadoop" >> $SQOOP_HOME/conf/sqoop-env.sh
$ echo "export HADOOP_MAPRED_HOME=/opt/hadoop" >> $SQOOP_HOME/conf/sqoop-env.sh

# 5. 添加datalibrary驱动
$ cp mysql-connector-java-8.0.28.jar $SQOOP_HOME/lib/

# 6. verificationinstallation
$ sqoop version

4.3 configurationfile说明

Sqoop 主要configurationfileincluding:

sqoop-env.sh: configurationHadoop and other依赖 environmentvariable
sqoop-site.xml: configurationSqoop coreparameter

5. Sqoopbasicoperation

5.1 列出datalibrary

# 列出MySQLdatalibraryin 所 has datalibrary
$ sqoop list-databases \n  --connect jdbc:mysql://localhost:3306/ \n  --username root \n  --password password

5.2 列出表

# 列出MySQLdatalibraryin 表
$ sqoop list-tables \n  --connect jdbc:mysql://localhost:3306/mydb \n  --username root \n  --password password

5.3 import表data to HDFS

# 将MySQLin 表import to HDFS
$ sqoop import \n  --connect jdbc:mysql://localhost:3306/mydb \n  --username root \n  --password password \n  --table employees \n  --target-dir /user/hadoop/employees \n  --delete-target-dir \n  --num-mappers 4 \n  --fields-terminated-by ','

5.4 importquery结果 to HDFS

# 将query结果import to HDFS
$ sqoop import \n  --connect jdbc:mysql://localhost:3306/mydb \n  --username root \n  --password password \n  --query "SELECT * FROM employees WHERE department_id=101 AND \$CONDITIONS" \n  --target-dir /user/hadoop/employees_dept_101 \n  --delete-target-dir \n  --num-mappers 4 \n  --fields-terminated-by ','

5.5 import表data to Hive

# 将MySQLin 表import to Hive
$ sqoop import \n  --connect jdbc:mysql://localhost:3306/mydb \n  --username root \n  --password password \n  --table employees \n  --hive-import \n  --hive-table mydb.employees \n  --create-hive-table \n  --delete-target-dir \n  --num-mappers 4

5.6 import表data to HBase

# 将MySQLin 表import to HBase
$ sqoop import \n  --connect jdbc:mysql://localhost:3306/mydb \n  --username root \n  --password password \n  --table employees \n  --hbase-table employees \n  --column-family info \n  --hbase-row-key employee_id \n  --num-mappers 4

5.7 exportHDFSdata to MySQL

# 将HDFSin dataexport to MySQL
$ sqoop export \n  --connect jdbc:mysql://localhost:3306/mydb \n  --username root \n  --password password \n  --table employees_export \n  --export-dir /user/hadoop/employees \n  --input-fields-terminated-by ',' \n  --num-mappers 4

6. Sqoopadvancedfunctions

6.1 增量import

增量import允许只import new 增加 data, 避免重复import. Sqoopsupport两种增量import模式:

6.1.1 基于行数增量import

# 基于行数 增量import
$ sqoop import \n  --connect jdbc:mysql://localhost:3306/mydb \n  --username root \n  --password password \n  --table employees \n  --target-dir /user/hadoop/employees \n  --incremental append \n  --check-column employee_id \n  --last-value 1000 \n  --num-mappers 4

6.1.2 基于时间戳增量import

# 基于时间戳 增量import
$ sqoop import \n  --connect jdbc:mysql://localhost:3306/mydb \n  --username root \n  --password password \n  --table employees \n  --target-dir /user/hadoop/employees \n  --incremental lastmodified \n  --check-column last_update_time \n  --last-value "2023-01-01 00:00:00" \n  --append \n  --num-mappers 4

6.2 data压缩

Sqoopsupportimportdata时for压缩, reducingstore空间 and network传输. support 压缩格式including:

gzip: usinggzip压缩
bzip2: usingbzip2压缩
lz4: usinglz4压缩
snappy: usingsnappy压缩

# usinggzip压缩importdata
$ sqoop import \n  --connect jdbc:mysql://localhost:3306/mydb \n  --username root \n  --password password \n  --table employees \n  --target-dir /user/hadoop/employees \n  --compress \n  --compression-codec org.apache.hadoop.io.compress.GzipCodec \n  --num-mappers 4

6.3 data转换

Sqoopsupport in import过程in for datafor转换, including:

6.3.1 列map

# 列map
$ sqoop import \n  --connect jdbc:mysql://localhost:3306/mydb \n  --username root \n  --password password \n  --table employees \n  --target-dir /user/hadoop/employees \n  --columns "employee_id,first_name,last_name,salary" \n  --num-mappers 4

6.3.2 字段分隔符

# 设置字段分隔符
$ sqoop import \n  --connect jdbc:mysql://localhost:3306/mydb \n  --username root \n  --password password \n  --table employees \n  --target-dir /user/hadoop/employees \n  --fields-terminated-by '\t' \n  --lines-terminated-by '\n' \n  --num-mappers 4

7. Sqoop集成example

7.1 集成Hive

usingSqoop将MySQLdataimport to Hive:

# 1. creationHivedatalibrary
$ hive -e "CREATE DATABASE IF NOT EXISTS mydb;"

# 2. importdata to Hive
$ sqoop import \n  --connect jdbc:mysql://localhost:3306/mydb \n  --username root \n  --password password \n  --table employees \n  --hive-import \n  --hive-table mydb.employees \n  --create-hive-table \n  --hive-overwrite \n  --num-mappers 4 \n  --fields-terminated-by ',' \n  --lines-terminated-by '\n'

7.2 集成HBase

usingSqoop将MySQLdataimport to HBase:

# 1. 确保HBase已启动
$ start-hbase.sh

# 2. importdata to HBase
$ sqoop import \n  --connect jdbc:mysql://localhost:3306/mydb \n  --username root \n  --password password \n  --table employees \n  --hbase-table employees \n  --column-family info \n  --hbase-row-key employee_id \n  --hbase-create-table \n  --num-mappers 4

7.3 集成Oozie

usingOozieschedulingSqoopjob:

# creationOozieworkflowXMLfile
$ vim sqoop-workflow.xml


    
    
        
            ${jobTracker}
            ${nameNode}
            import --connect jdbc:mysql://localhost:3306/mydb --username root --password password --table employees --target-dir /user/hadoop/employees --num-mappers 4
        
        
        
    
    
        Sqoop job failed, error message[${wf:errorMessage(wf:lastErrorNode())}]

实践练习

练习1: Sqoopinstallation and configuration

installation并configurationSqoop, 确保able to正常连接 to MySQLdatalibrary.

练习2: basicimportoperation

将MySQLin 表import to HDFS, verificationdata is 否正确import.

练习3: import to Hive

configurationSqoop, 将MySQLin 表import to Hivein.

练习4: import to HBase

configurationSqoop, 将MySQLin 表import to HBasein.

练习5: 增量import

usingSqoop 增量importfunctions, 只import new 增加 data.

练习6: dataexport

将HDFSin dataexport to MySQLdatalibrary.

8. Sqoopbest practices

8.1 performanceoptimization

调整parallel度: 根据datalibrary processingcapacity and Hadoopcluster resourcecircumstances, 调整num-mappersparameter
using批量operation: 调整batchparameter, 增加批量processingcapacity
using压缩: for import datafor压缩, reducingstore空间 and network传输
optimizationquery: for 于queryimport, optimizationSQL语句, 添加适当 index
合理设置分割列: 选择分布均匀列serving as分割列, improvingparallelefficiency

8.2 reliabilitydesign

using断点续传: for 于large-scaledataimport, 启用断点续传functions
定期backup: 定期backupimport data, 防止dataloss
monitorjobstatus: usingYARN or AmbarimonitorSqoopjob 执行status
设置合理超时时间: 根据data量 and networkcircumstances, 设置合理超时时间

8.3 security考虑

保护password: 避免 in commands行in直接输入password, 可以using--password-fileparameter from filein读取password
usingKerberosauthentication: for 于security要求较 high environment, 启用Kerberosauthentication
限制datalibrarypermission: for Sqoopusing datalibraryuser分配最 small 必要 permission
encryption传输: for 于敏感data, 启用SSL/TLSencryption传输

8.4 commonissuesprocessing

连接超时: checknetwork连接 and datalibrarystatus, 调整超时时间
memory不足: 调整MapReduce memoryconfiguration, 增加mapreduce.map.java.optsparameter
dataclass型不兼容: checkdatalibrary表 and Hadoopstoresystem dataclass型map, using--map-column-hive or --map-column-javaparameterfor调整
分割列issues: 选择分布均匀列serving as分割列, 避免data倾斜

9. summarized

本tutorial深入介绍了Sqoopdatamigrationtool basicconcepts, architecturedesign and usingmethod. Sqoopserving asHadoopecosystemin important component, providing了 high 效, reliable datamigrationcapacity, able to in relationships型datalibrary and Hadoopstoresystem之间implementationlarge-scaledata 传输.

Sqoop corefunctionsincludingdataimport and export, support many 种datalibrary连接器 and Hadoopstoresystem. through合理configuration and optimization, Sqoop可以implementation high performance, high reliability datamigration. in practicalapplicationin, Sqoop常用于ETL流程, datamigration and data集成etc.场景.

in usingSqoop时, 需要注意performanceoptimization, reliabilitydesign and security考虑etc.best practices, 以确保Sqoopsystem stable run and high 效performance.

on 一课: Flumedistributedlog收集返回tutoriallist under 一课: Oozieworkflowschedulingsystem