Oozieworkflowschedulingsystem

深入LearningOozieworkflowschedulingsystem basicconcepts, architecturedesign, working principles and usingmethod

Oozieworkflowschedulingsystem

1. OozieIntroduction

Oozie is a 用于managementHadoopjob workflowschedulingsystem, 它able to按照预定 顺序 and 条件执行一系列Hadoopjob. Ooziesupport many 种class型 job, includingMapReduce, Pig, Hive, Sqoopetc., is Hadoopecosystemin important component.

1.1 Oozie design目标

  • reliability: 确保workflow reliable 执行 and failurerestore
  • scale性: support many 种class型 Hadoopjob and 自定义jobclass型
  • flexible scheduling: support基于时间 and event scheduling
  • visualizationmanagement: providingWeb界面用于monitor and managementworkflow
  • and Hadoopecosystem集成: and Hadoop, YARN, HDFSetc.component无缝集成

1.2 Oozie application场景

  • ETL流程: management complex ETLjob流程
  • dataanalysispipeline: scheduling一系列dataanalysisjob
  • 定时dataprocessing: 按照预定时间执行dataprocessingtask
  • 依赖job执行: management具 has 依赖relationships job序列
  • 批量jobscheduling: schedulinglarge-scale 批量job

2. Ooziecore concepts

2.1 workflow (Workflow)

workflow is Oozie core concepts, 它由一系列 has 向无环graph (DAG) 表示 jobnode组成. 每个node代表一个job or 控制流operation, node之间 edge表示job 执行顺序 and 依赖relationships.

2.2 协调器 (Coordinator)

协调器用于基于时间 and dataavailability来schedulingworkflow. It supports周期性scheduling and 基于data触发 scheduling, 可以根据预定 时间间隔 or data to 达event来启动workflowinstance.

2.3 捆绑器 (Bundle)

捆绑器用于将 many 个协调器application程序组合 in 一起, 形成一个更 big jobcollection. 它允许user同时management and monitor many 个协调器application程序.

2.4 动作 (Action)

动作 is workflowin basic执行单元, 代表一个具体 job or operation. Ooziesupport many 种class型 动作, including:

  • MapReduce动作: 执行MapReducejob
  • Pig动作: 执行Pig脚本
  • Hive动作: 执行Hivequery or 脚本
  • Sqoop动作: 执行Sqoopimport or exportjob
  • Shell动作: 执行Shellcommands or 脚本
  • Java动作: 执行Javaclass
  • Decision动作: 根据条件执行不同 branch
  • Fork/Join动作: implementationparallel执行 and synchronization
  • Email动作: 发送电子emailnotification

2.5 控制node (Control Node)

控制node用于控制workflow 执行流程, including:

  • Startnode: workflow 起始点
  • Endnode: workflow 结束点
  • Killnode: 终止workflow 执行
  • Decisionnode: 根据条件选择执行path
  • Forknode: parallel执行 many 个branch
  • Joinnode: etc.待 many 个parallelbranchcompletion

3. Ooziearchitecturedesign

3.1 architecture组成

Oozie architecture主要由以 under component组成:

  • Oozieserver (Oozie Server) : processing客户端request, managementworkflow 执行
  • Oozie客户端 (Oozie Client) : providingcommands行界面 and REST API, 用于submitting and managementworkflow
  • Web控制台 (Web Console) : providingWeb界面用于monitor and managementworkflow
  • datalibrary (Database) : storeworkflow 元data and statusinformation
  • Hadoop集成: and Hadoop, YARN, HDFSetc.component集成

3.2 workflow程

Oozie workflow程such as under :

  1. userthroughOozie客户端 or Web控制台submittingworkflow定义
  2. Oozieserver将workflow定义store to datalibraryin
  3. Oozieservercreationworkflowinstance
  4. Oozieserver按照workflow定义执行jobnode
  5. for 于每个jobnode, Oozieserversubmitting相应 job to Hadoopcluster
  6. Hadoopcluster执行job并返回结果
  7. Oozieserverupdateworkflowinstance status
  8. 重复步骤4-7, 直 to workflow执行completion or 失败
  9. Oozieservernotificationuserworkflow 执行结果

4. Oozieinstallation and configuration

4.1 installation before 提

installationOozie之 before , 需要先installation以 under 软件:

  • Java JDK 1.8+
  • Hadoop 2.0+
  • relationships型datalibrary (such asMySQL, PostgreSQLetc.)
  • Tomcat 7.0+ (可选, 用于deploymentOozie Web控制台)

4.2 installation步骤

# 1.  under 载Oozie
$ wget https://dlcdn.apache.org/oozie/5.2.1/oozie-5.2.1.tar.gz

# 2. 解压Oozie
$ tar -xzvf oozie-5.2.1.tar.gz
$ mv oozie-5.2.1 /opt/oozie

# 3. configurationenvironmentvariable
$ export OOZIE_HOME=/opt/oozie
$ export PATH=$PATH:$OOZIE_HOME/bin

# 4. configurationOozie
$ cp $OOZIE_HOME/conf/oozie-site.xml.template $OOZIE_HOME/conf/oozie-site.xml

# 5. modifyoozie-site.xmlconfiguration
# 设置datalibrary连接information
$ sed -i 's|oozie.service.JPAService.jdbc.driver|org.mariadb.jdbc.Driver|' $OOZIE_HOME/conf/oozie-site.xml
$ sed -i 's|oozie.service.JPAService.jdbc.url|jdbc:mysql://localhost:3306/oozie?createDatabaseIfNotExist=true|' $OOZIE_HOME/conf/oozie-site.xml
$ sed -i 's|oozie.service.JPAService.jdbc.username|root|' $OOZIE_HOME/conf/oozie-site.xml
$ sed -i 's|oozie.service.JPAService.jdbc.password|password|' $OOZIE_HOME/conf/oozie-site.xml

# 6. 添加datalibrary驱动
$ cp mysql-connector-java-8.0.28.jar $OOZIE_HOME/lib/

# 7. 准备Oozie WARpackage
$ $OOZIE_HOME/bin/oozie-setup.sh prepare-war

# 8. 初始化datalibrary
$ $OOZIE_HOME/bin/ooziedb.sh create -sqlfile oozie.sql -run

# 9. 启动Oozieservice
$ $OOZIE_HOME/bin/oozie-start.sh

# 10. verificationOozieservice
$ oozie admin -oozie http://localhost:11000/oozie -status

4.3 configurationfile说明

Oozie 主要configurationfileincluding:

  • oozie-site.xml: configurationOozie coreparameter, such asdatalibrary连接, 端口号etc.
  • oozie-env.sh: configurationOozie environmentvariable, such asJavapath, Hadooppathetc.
  • log4j.properties: configurationOozie log记录

5. Ooziebasicoperation

5.1 submittingworkflow

# submittingworkflow
$ oozie job -oozie http://localhost:11000/oozie -config job.properties -run

5.2 查看workflowstatus

# 查看所 has workflowinstance
$ oozie jobs -oozie http://localhost:11000/oozie

# 查看specificworkflowinstance status
$ oozie job -oozie http://localhost:11000/oozie -info job_id

5.3 控制workflow

# 暂停workflow
$ oozie job -oozie http://localhost:11000/oozie -suspend job_id

# restoreworkflow
$ oozie job -oozie http://localhost:11000/oozie -resume job_id

# 终止workflow
$ oozie job -oozie http://localhost:11000/oozie -kill job_id

5.4 查看workflowlog

# 查看workflowlog
$ oozie job -oozie http://localhost:11000/oozie -log job_id

6. Oozieworkflow定义

6.1 workflowXMLstructure

OozieworkflowusingXML格式定义, 主要including以 under 部分:


    
    
    
        
            ${jobTracker}
            ${nameNode}
            
                
                    mapreduce.job.queuename
                    default
                
                
                    mapreduce.mapper.class
                    org.apache.hadoop.examples.WordCount$TokenizerMapper
                
                
                    mapreduce.reducer.class
                    org.apache.hadoop.examples.WordCount$IntSumReducer
                
                
                    mapreduce.input.fileinputformat.inputdir
                    ${inputDir}
                
                
                    mapreduce.output.fileoutputformat.outputdir
                    ${outputDir}
                
            
        
        
        
    
    
    
        Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
    
    
    

6.2 jobpropertyfile

jobpropertyfile用于定义workflowinusing variable, such asHadoopcluster地址, 输入输出pathetc.:

# job.properties
nameNode=hdfs://localhost:9000
jobTracker=localhost:8032
hdfsUser=hadoop
oozie.wf.application.path=${nameNode}/user/${hdfsUser}/examples/workflow
inputDir=/user/hadoop/input
outputDir=/user/hadoop/output

6.3 example: Hivejobworkflow


    
    
    
        
            ${jobTracker}
            ${nameNode}
            
                
            
            
                
                    mapreduce.job.queuename
                    default
                
            
            
            INPUT_DIR=${inputDir}
            OUTPUT_DIR=${outputDir}
        
        
        
    
    
    
        Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
    
    
    

7. Oozie协调器 (Coordinator)

7.1 协调器定义

协调器用于基于时间 and dataavailability来schedulingworkflow, 其XMLstructuresuch as under :

3 1 1 ${nameNode}/user/${hdfsUser}/input/${YEAR}-${MONTH}-${DAY}-${HOUR}-${MINUTE} ${coord:current(0)} ${workflowPath} jobTracker ${jobTracker} nameNode ${nameNode} inputDir ${coord:dataIn('input')} outputDir ${outputDir}

7.2 submitting协调器

# submitting协调器
$ oozie job -oozie http://localhost:11000/oozie -config coordinator.properties -run

8. Ooziebest practices

8.1 performanceoptimization

  • 调整thread池 big small : 根据cluster规模 and load调整Oozie thread池 big small
  • optimizationdatalibrary连接: 调整datalibrary连接池 big small and 超时时间
  • 合理设置concurrent度: 根据clusterresourcecircumstances设置协调器 concurrencyparameter
  • optimizationworkflowdesign: 避免过于 complex workflow, 合理分解 big 型workflow
  • using增量processing: for 于large-scaledata, 考虑using增量processing方式

8.2 reliabilitydesign

  • 启用check点: for 于 long 时间run workflow, 启用check点functions
  • 设置合理 超时时间: for 每个动作设置合理 超时时间
  • implementationerrorprocessing: in workflowin添加errorprocessingnode, processing各种exceptioncircumstances
  • 定期backupdatalibrary: 定期backupOozie 元datalibrary, 防止dataloss
  • monitorworkflowstatus: usingOozie Web控制台 or 第三方monitortoolmonitorworkflowstatus

8.3 security性考虑

  • 启用Kerberosauthentication: for 于security要求较 high environment, 启用Kerberosauthentication
  • 限制访问permission: usingOozie 访问控制list (ACL) 限制user for workflow 访问permission
  • encryption传输: 启用SSL/TLSencryptionOozie客户端 and server之间 通信
  • auditlog: 启用auditlog, 记录user operationbehavior

8.4 commonissuesprocessing

  • workflow卡住: checkjobstatus and log, 确认 is 否存 in resource不足 or joberror
  • datalibrary连接失败: checkdatalibrarystatus and 连接configuration, 确认datalibrary is 否正常run
  • permissionissues: checkOozieuser and Hadoopuser permissionconfiguration, 确认 is 否 has 足够 permission执行job
  • versioncompatibility: 确保Oozieversion and Hadoopversion兼容
  • memory不足: 调整Oozieserver memoryconfiguration, 增加堆memory big small

实践练习

练习1: Oozieinstallation and configuration

installation并configurationOozie, 确保Oozieserviceable to正常启动 and run.

练习2: simple workflowcreation

creation一个 simple Oozieworkflow, 执行WordCountjob.

练习3: Hivejobworkflow

creation一个Oozieworkflow, 执行Hivequeryjob.

练习4: 协调器scheduling

creation一个Oozie协调器, 定期schedulingworkflow执行.

练习5: paralleljob执行

usingFork/Joinnodecreation一个parallel执行 many 个job workflow.

练习6: errorprocessing

in workflowin添加errorprocessingnode, processingjob执行失败 circumstances.

9. summarized

本tutorial深入介绍了Oozieworkflowschedulingsystem basicconcepts, architecturedesign and usingmethod. Oozieserving asHadoopecosystemin important component, providing了 reliable , flexible workflowschedulingcapacity, able tomanagement complex Hadoopjob流程.

Oozie corefunctionsincludingworkflowmanagement, 协调器scheduling and 捆绑器management, support many 种class型 Hadoopjob and 自定义jobclass型. through合理design and configurationOozieworkflow, 可以implementation complex ETL流程, dataanalysispipeline and 定时dataprocessingtask.

in usingOozie时, 需要注意performanceoptimization, reliabilitydesign and security考虑etc.best practices, 以确保Ooziesystem stable run and high 效performance.