HDFScore concepts and architecture
1. HDFSdesign目标
HDFS (Hadoop Distributed File System) is Hadoopecosystem corecomponent之一, designed forstore and processinglarge-scaledata集而design. 它 design目标including:
- high reliability: throughdatacopymechanism, 确保data in nodefailure时不会loss
- high throughput: 优先考虑data访问 high throughput, 而非 low latency
- large-scaledata集support: able tostore and processingTB级甚至PB级 data
- simple consistencymodel: adopts一次写入, many 次读取 (WORM) consistencymodel
- 跨平台compatibility: 可以 in 各种硬件 and operationsystem on run
2. HDFScore concepts
2.1 data块 (Block)
data块 is HDFSstore basic单位. in HDFSin, file被分割成固定 big small data块forstore. 默认circumstances under , HDFS data块 big small for 128MB (可以throughconfiguration调整) .
for whatHDFS data块这么 big ?
HDFSdesign较 big data块主要 is for 了reducing寻址开销. 较 big data块可以使data传输时间远 big 于寻址时间, from 而improvingsystem throughput.
2.2 copy因子 (Replication Factor)
copy因子 is 指HDFSin每个data块 replica数量. 默认circumstances under , copy因子 for 3, 即每个data块会被store in 3个不同 node on .
2.3 机架感知 (Rack Awareness)
HDFS具 has 机架感知capacity, 它会将data块 replica分布 in 不同 机架 on . 这样design good 处 is :
- improvingdatareliability: 即使整个机架failure, data仍然可用
- improvingdata访问速度: 可以 from 同一机架 in node读取data, reducingnetwork传输
- 均衡networkload: 跨机架 data传输会消耗更 many networkbandwidth, 机架感知可以reducing这种circumstances
3. HDFSarchitecturedesign
HDFSadoptsmaster-slave (Master-Slave) architecture, 主要package含以 under component:
3.1 NameNode
NameNode is HDFS 主node, 负责managementfilesystem namespace and 客户端 for file 访问. 它 主要functionsincluding:
- namespacemanagement: maintenancefilesystem Table of Contentstreestructure
- 元datamanagement: storefile 元data, includingfile名, permission, big small , creation时间etc.
- data块map: 记录每个file data块分布circumstances
- processing客户端request: processing客户端 file访问request
- managementDataNode: monitorDataNode status, processingDataNode register, 心跳 and 块报告
3.2 DataNode
DataNode is HDFS from node, 负责practicalstoredata块 and processing客户端 data访问request. 它 主要functionsincluding:
- storedata块: practicalstorefile data块
- data块management: 执行data块 creation, delete and copyoperation
- 心跳mechanism: 定期向NameNode发送心跳message, 报告nodestatus
- 块报告: 定期向NameNode发送块报告, 报告node on store data块list
- data传输: processing客户端 data读写request
3.3 Secondary NameNode
Secondary NameNode is NameNode 辅助node, 用于辅助NameNodemanagement元data, improvingsystem reliability. 它 主要functionsincluding:
- merge编辑log: 定期 from NameNode获取编辑log (EditLog) , 并将其merge to namespace镜像 (FSImage) in
- creationcheck点: 定期creationnamespace镜像 check点, reducingNameNode 启动时间
- 辅助restore: in NameNodefailure时, 可以usingSecondary NameNode check点data辅助restore
Notes
Secondary NameNode不 is NameNode 热backup, 它不能直接替代NameNodeprovidingservice. in NameNodefailure时, 需要手动将Secondary NameNode check点datacopy to new NameNode on .
4. HDFSworking principles
4.1 file写入流程
客户端向HDFS写入file 流程such as under :
- 客户端向NameNode发送creationfilerequest
- NameNodecheckfile is 否存 in , 以及客户端 is 否 has creationpermission
- such as果checkthrough, NameNode返回成功response, 允许客户端creationfile
- 客户端将file分割成data块, 并向NameNoderequestdata块 store位置
- NameNode根据机架感知策略, for 每个data块分配3个DataNodenode
- 客户端直接向第一个DataNode写入data块
- 第一个DataNode将data块copy to 第二个DataNode
- 第二个DataNode将data块copy to 第三个DataNode
- 每个DataNode向客户端发送确认message
- 客户端向NameNode发送completionrequest, NameNodeupdate元data
4.2 file读取流程
客户端 from HDFS读取file 流程such as under :
- 客户端向NameNode发送file读取request
- NameNodecheckfile is 否存 in , 以及客户端 is 否 has 读取permission
- such as果checkthrough, NameNode返回file data块分布information
- 客户端根据data块分布information, 直接向最近 DataNode读取data块
- 客户端将读取 data块组合成完整 file
4.3 data块copy策略
HDFSadopts以 under data块copy策略:
- 第一个replica: store in 客户端where node (such as果客户端不 in clusterin, 则随机选择一个node)
- 第二个replica: store in and 第一个replica不同机架 node on
- 第三个replica: store in and 第二个replica同一机架 不同node on
- 更 many replica: 随机分布 in clusterin node on
5. HDFScommands行operation
HDFSproviding了丰富 commands行tool, 用于management and operationfilesystem. 以 under is 一些常用 HDFScommands:
5.1 filesystembasicoperation
# creationTable of Contents $ hdfs dfs -mkdir /user/hadoop/test # 查看Table of Contents in 容 $ hdfs dfs -ls /user/hadoop # on 传file $ hdfs dfs -put localfile /user/hadoop/test/ # under 载file $ hdfs dfs -get /user/hadoop/test/remotefile localfile # 查看file in 容 $ hdfs dfs -cat /user/hadoop/test/file.txt # deletefile $ hdfs dfs -rm /user/hadoop/test/file.txt # deleteTable of Contents (递归) $ hdfs dfs -rm -r /user/hadoop/test # copyfile $ hdfs dfs -cp /user/hadoop/source /user/hadoop/destination # movefile $ hdfs dfs -mv /user/hadoop/source /user/hadoop/destination # 查看file big small $ hdfs dfs -du -h /user/hadoop/test/file.txt
5.2 clustermanagementoperation
# checkHDFShealthystatus $ hdfs dfsadmin -report # 进入security模式 $ hdfs dfsadmin -safemode enter # 离开security模式 $ hdfs dfsadmin -safemode leave # checksecurity模式status $ hdfs dfsadmin -safemode get # 平衡data块分布 $ hdfs balancer # 查看NameNodelog $ tail -f $HADOOP_HOME/logs/hadoop-hadoop-namenode-*.log
6. HDFSadvanced features
6.1 snapshot (Snapshot)
HDFSsnapshot is filesystem in 某一时刻 只读replica, 用于databackup and restore. snapshot可以:
- 保护data免受usererroroperation 影响
- 用于databackup and restore
- supporttest and Developmentenvironment data隔离
6.2 联邦 (Federation)
HDFS联邦允许 in clusterinrun many 个NameNode, 每个NameNodemanagementfilesystemnamespace 一部分. 它 主要优势including:
- improvingnamespace scale性
- improvingcluster throughput
- support不同application程序 namespace隔离
6.3 high availability性 (High Availability)
HDFShigh availability性throughrun两个NameNode (Active and Standby) 来implementation, 当Active NameNodefailure时, Standby NameNode可以自动切换 for Activestatus, 确保system 连续性.
实践练习
练习1: HDFScommands行operation
usingHDFScommands行toolcompletion以 under operation:
- creation一个名 for "practice" Table of Contents
- on 传一个本地file to 该Table of Contents
- 查看该file in 容
- 查看该file big small and 块分布circumstances
- under 载该file to 本地 另一个位置
- delete该file and Table of Contents
练习2: HDFSclustermanagement
usingHDFSmanagementcommandscompletion以 under operation:
- 查看HDFScluster healthystatus
- 查看NameNode logfile
- checkHDFS security模式status
练习3: analysisHDFSdata块分布
on 传一个 big small 超过128MB file to HDFS, 然 after 查看该file data块分布circumstances, analysisHDFSsuch as何将 big file分割成 many 个data块.
7. summarized
本tutorial深入介绍了HDFS core concepts, architecturedesign and working principles. HDFSadoptsmaster-slavearchitecture, 由NameNode, DataNode and Secondary NameNode组成, 具 has high reliability, high throughput and large-scaledata集supportetc.特点.
HDFS design目标 is for large-scaledataprocessingapplicationproviding reliable storeservice, 它throughdatacopymechanism确保data reliability, through机架感知策略improvingdata访问速度 and system fault tolerancecapacity.
understandingHDFS core concepts and working principles for 于LearningHadoopecosystem至关 important , 它 is after 续LearningMapReduce, YARN and otherHadoopcomponent Basics.