HadoopIntroduction and ecosystem

1. Hadoop 起sources and 发展

Hadoop is a open-source distributed计算framework, 用于processinglarge-scaledata集. 它起sources可以追溯 to 2002年Google发表三篇论文, 这些论文奠定了Hadoop techniquesBasics.

1.1 techniques背景

2002年: Google发表《The Google File System》论文, 介绍了distributedfilesystem design理念
2003年: Google发表《MapReduce: Simplified Data Processing on Large Clusters》论文, 提出了distributed计算model
2004年: Google发表《Bigtable: A Distributed Storage System for Structured Data》论文, 介绍了distributeddatalibrarysystem
2005年: Doug Cutting and Mike Cafarella基于Google 论文implementation了Hadoop原型
2006年: Hadoop被捐赠给Apache软件基金会, 成 for Apache顶级project

1.2 Hadoop 命名由来

Hadoop 名字来sources于Doug Cutting儿子黄色 big 象玩具. 这个名字 simple 易记, 没 has 特殊含义, 符合open-sourceproject 命名习惯.

2. Hadoopcorecomponent

Hadoopecosystempackage含 many 个corecomponent, 其in最Basics 三个component is :

2.1 HDFS (Hadoop Distributed File System)

HDFS is Hadoop distributedfilesystem, 用于storelarge-scaledata集. 它具 has 以 under 特点:

high reliability: throughdatacopymechanism, 将datastore in many 个node on , 即使个别nodefailure, data也不会loss
high scale性: 可以through添加 new nodeeasilyscalestore容量
high throughput: 适合processinglarge-scaledata 批量processing, 而不 is low latency 交互式访问
适合 big data集: supportstoreTB级甚至PB级 data

2.2 MapReduce

MapReduce is Hadoop distributed计算framework, 用于processinglarge-scaledata集. 它 core思想 is 将计算task分解 for 两个阶段:

Map阶段: 将输入data分割成 many 个片段, 由 many 个Maptaskparallelprocessing, 生成in间结果
Reduce阶段: 将Map阶段生成 in间结果merge, 生成最终输出

2.3 YARN (Yet Another Resource Negotiator)

YARN is Hadoop resourcemanagementsystem, 负责managementclusterin resource并schedulingjob. 它具 has 以 under 特点:

resourcemanagement: 统一managementclusterin CPU, memoryetc.resource
jobscheduling: 根据job resourcerequirements and priority, 将resource分配给job
scale性: support many 种计算framework, such asMapReduce, Spark, Flinketc.

3. Hadoopecosystem

Hadoopecosystempackage含 many 个tool and framework, 用于processing big data 不同方面:

3.1 datastore层

HDFS: distributedfilesystem
HBase: distributed列式datalibrary, 适合实时随机访问
Hive Metastore: 元datastore, 用于managementHive表 structureinformation

3.2 dataprocessing层

MapReduce: 批processingframework
Hive: 基于MapReduce data仓librarytool, supportSQLquery
Pig: advanceddata流processinglanguage, 用于简化MapReduceprogramming
Spark: fast 速common big dataprocessing引擎, support批processing, 流processing and 机器Learning
Flink: 流processingframework, support low latency 实时dataprocessing

3.3 datamanagement and scheduling层

YARN: resourcemanagement and jobscheduling
ZooKeeper: distributed协调service, 用于managementconfigurationinformation, 命名serviceetc.
Oozie: workflowschedulingtool, 用于协调 many 个Hadoopjob
Sqoop: 用于 in Hadoop and relationships型datalibrary之间传输data
Flume: distributedlog收集system, 用于收集, aggregate and 传输 big 量logdata

3.4 dataanalysis and 机器Learning层

Mahout: distributed机器Learninglibrary
Spark MLlib: Spark 机器Learninglibrary
Impala: 实时SQLquery引擎, 用于交互式dataanalysis
Phoenix: 基于HBase SQLquery引擎

4. Hadoop application场景

Hadoopwidely used in各个领域, including:

4.1 loganalysis

企业可以usingHadoopstore and analysis big 量logdata, extracting valuable information, such asuserbehavioranalysis, systemperformancemonitoretc..

4.2 data仓library

Hadoop可以serving asdata仓library store底层, 用于storelarge-scalehistorydata, 并support complex dataanalysisquery.

4.3 机器Learning

Hadoop可以用于训练large-scale机器Learningmodel, processing海量训练data.

4.4 推荐system

基于Hadoop 推荐system可以analysisuser historybehaviordata, for userprovidingpersonalized 推荐service.

4.5 科学计算

in 天文学, 生物学, 气象学 and other fields, Hadoop可以用于processing and analysislarge-scale科学data.

5. Hadoop 优势 and challenges

5.1 优势

open-source免费: 基于Apacheopen-sourceprotocol, 无需支付 high 昂软件费用
high reliability: throughdatacopy and fault tolerancemechanism, 确保data security性 and availability
high scale性: 可以easilyscalecluster规模, processing越来越 big data量
flexible datamodel: supportstructure化, 半structure化 and 非structure化data
成熟 ecosystem: 拥 has 丰富 tool and framework, 覆盖 big dataprocessing 各个方面

5.2 challenges

Learning曲线陡峭: 需要Master many 个component and tool, Learning成本较 high
performance调优 complex : 需要根据具体application场景调整 many 个parameter, 以获得最佳performance
实时processingcapacity has 限: 传统Hadoop适合批processing, 实时processingcapacity相 for 较弱
resource消耗 big : 需要 big 量硬件resource, includingserver, store and network

实践练习

练习1: UnderstandHadoopversion

访问Apache Hadoop官方网站, UnderstandHadoop 最 new version and 主要features.

练习2: 绘制Hadoopecosystemgraph

根据本tutorial in 容, 绘制Hadoopecosystemgraph, includingcorecomponent and 常用tool.

练习3: analysisHadoopapplication场景

选择一个你熟悉行业, analysisHadoop in 该行业in 潜 in application场景.

6. summarized

本tutorial介绍了Hadoop basicconcepts, history背景, corecomponent and ecosystem. Hadoopserving as big dataprocessing Basicsframework, 具 has high reliability, high scale性 and high throughputetc.特点, widely used in各个领域.

Hadoop corecomponentincludingHDFS, MapReduce and YARN, 它们分别负责distributedstore, distributed计算 and resourcemanagement. 此 out , Hadoopecosystem还package含 many 个tool and framework, 用于processing big data 不同方面.

虽然Hadoop面临着Learning曲线陡峭, performance调优 complex etc.challenges, 但它仍然 is big dataprocessing领域 important techniques, for processinglarge-scaledataproviding了 reliable solution.

返回tutoriallist under 一课: HDFScore concepts and architecture