techniquesoverview
Apache Hive is a 基于 Hadoop data warehousetool, providing了class似 SQL querylanguage (HiveQL) , 允许userusing SQL 语法query and analysisstore in HDFS or otherdistributedstoresystemin large-scaledata. Hive 将 SQL query转换 for MapReduce, Tez or Spark job, in distributedcluster on 执行.
Hive 最初由 Facebook Development, 于 2008 年open-source, 并于 2010 年成 for Apache 顶级project. such as今, Hive 已成 for big dataecosystemin最 important data warehousetool之一, 被widely used indataanalysis, data warehousebuild and business intelligenceetc.领域.
Hive architecture
Hive adopts了分层architecturedesign, 主要including以 under component:
userInterfaces层
including CLI (commands行Interfaces) , HiveServer2 (supportsJDBC/ODBC) , Web UI and Thrift service, Used foruser and Hive 交互.
驱动层
including JDBC 驱动, ODBC 驱动 and Thrift 客户端, Used for连接 Hive server.
coreservice层
includingquery解析器, queryoptimization器 and 执行计划生成器, 负责将 HiveQL 转换 for 可执行 job.
元datastore
usingrelationships型Database (such as MySQL, PostgreSQL) store表structure, partitioninformation, 列class型etc.元data.
执行引擎
supports many 种执行引擎, including MapReduce, Tez and Spark, 负责执行生成 job.
store层
底层using HDFS or otherdistributedstoresystem (such as S3, HBase) storedata.
core Features
class SQL 语法
providingclass似 SQL HiveQL language, reducedLearn门槛, 使熟悉 SQL usercan fast 速 on 手.
可scale性
基于 Hadoop ecosystem, supports水平scale, 可以processing PB 级别 data.
flexible性
supports many 种store格式 (such as TextFile, Parquet, ORC) and 压缩格式 (such as Snappy, Gzip) .
data warehousefunctions
supportsdatapartition, 分桶, indexetc.data warehousefeatures, 便于datamanagement and queryoptimization.
scale性
supports UDF (user自定义function) , UDAF (user自定义aggregatefunction) and UDTF (user自定义表生成function) .
集成性
and Hadoop other components in the ecosystem (such as Spark, Pig, HBase) 无缝集成.
Application Scenarios
1. data warehousebuild
Hive is building enterprise data warehouses 理想tool, 可以processing and analysislarge-scalestructure化data.
2. dataanalysis
using HiveQL for complex dataanalysis, includingdataaggregate, sort, filter and 连接etc.operation.
3. Log Analysis
analysislarge-scalelogdata, such asweb access logs, server logs and application logsetc..
4. business intelligence
and business intelligencetool (such as Tableau, Power BI) 集成, 生成visualizationreport and 仪表盘.
5. datamigration
in 不同storesystem之间fordatamigration and 转换.
6. 机器Learndata准备
for 机器Learnmodel准备 and processing训练data.
Learning Path
Learn Hive 需要Master一系列相关techniques and concepts, 以 under is 推荐 Learning Path:
1. Basicsknowledge准备
Learn SQL Basicsknowledge and Hadoop ecosystem, Master HDFS and MapReduce basicconcepts.
- Master SQL 语法 and basicoperation
- Understand Hadoop basicconcepts (HDFS, MapReduce)
- 熟悉 Linux commands and basicoperation
2. Hive Install and configuration
Learn Hive Install, configuration and basicusingmethod.
- Hive Install and configuration
- Hive commands行tool using
- HiveServer2 and JDBC/ODBC configuration
3. HiveQL Basics
Learn HiveQL basic语法 and operation, including表 creation, data加载, query and deleteetc..
- Database and 表 creation and management
- data加载 and export
- basicqueryoperation (SELECT, WHERE, GROUP BY, ORDER BY)
- 连接operation (JOIN)
4. Hive advanced features
Learn Hive advanced features, includingpartition, 分桶, index and 视graphetc..
- datapartition and 分桶
- index and 视graph
- 自定义function (UDF, UDAF, UDTF)
- store格式 and 压缩
5. Hive optimization
Learn Hive queryoptimizationtechniques, improvingqueryperformance.
- queryoptimization器原理
- partition and 分桶optimization
- 执行计划optimization
- data倾斜processing
6. data warehousedesign
Learndata warehousedesignprinciples and best practices, using Hive builddata warehouse.
- data warehousedesignprinciples
- 星型model and 雪花model
- ETL 流程design
- dataqualitymanagement
7. practicalapplicationscase
throughpracticalcaseLearn Hive in dataanalysis and data warehousebuildin applications.
- Log Analysiscase
- data warehousebuildcase
- business intelligencecase
Start Learning Hive
现 in , 您已经Understand了 Hive basicconcepts and Learning Path, Next可以开始systemLearn Hive Tutorial了. 我们 Tutorial涵盖了From Basics to advanced 所 has in 容, including Hive Installconfiguration, HiveQL Basics, advanced features, queryoptimization and data warehousedesignetc..