Hive Introduction and installationconfiguration
1. Hive basicconcepts
Apache Hive is a 基于 Hadoop data仓librarytool, providing了class似 SQL querylanguage (HiveQL) , 允许userusing SQL 语法query and analysisstore in HDFS or otherdistributedstoresystemin large-scaledata.
1.1 Hive core特点
- class SQL 语法: providingclass似 SQL HiveQL, 降 low Learning门槛
- 可scale性: 基于 Hadoop, support水平scale
- flexible性: support many 种store格式 and 压缩方式
- data仓libraryfunctions: supportpartition, 分桶, indexetc.
- 集成性: and Hadoop ecosystem无缝集成
1.2 Hive and 传统datalibrary 区别
| features | Hive | 传统datalibrary |
|---|---|---|
| querylanguage | HiveQL (class SQL) | SQL |
| 执行引擎 | MapReduce/Tez/Spark | in 置执行引擎 |
| latency | high latency (批processing) | low latency (实时processing) |
| data规模 | PB 级 | GB/TB 级 |
| updateoperation | has 限support | 完全support |
2. Hive architecture
Hive adopts分层architecturedesign, 主要including以 under component:
2.1 userinterface层
- CLI: commands行interface, 直接 and Hive 交互
- HiveServer2: support JDBC/ODBC 连接, 允许远程访问
- Web UI: providing Web 界面, 方便management and monitor
2.2 驱动层
- JDBC 驱动: Java datalibrary连接驱动
- ODBC 驱动: 开放datalibrary连接驱动
- Thrift 客户端: 基于 Thrift protocol 客户端
2.3 coreservice层
- query解析器: 将 HiveQL 转换 for abstraction语法tree
- queryoptimization器: optimizationquery计划, improving执行efficiency
- 执行计划生成器: 生成可执行 job计划
2.4 元datastore
store Hive 元datainformation, including:
- datalibrary and 表 structure
- partition and 分桶information
- 列class型 and property
- 表 store位置 and 格式
2.5 执行引擎
- MapReduce: 传统 批processing引擎
- Tez: optimization DAG 执行引擎
- Spark: fast 速 memory计算引擎
3. Hive installation before 准备
3.1 environment要求
- Java: JDK 1.8 or 更 high version
- Hadoop: Hadoop 2.x or 3.x version
- operationsystem: Linux (推荐 CentOS/Ubuntu)
3.2 installation Hadoop (简要)
such as果还没 has installation Hadoop, 可以按照以 under 步骤forinstallation:
# under 载 Hadoop wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz # 解压file tar -zxvf hadoop-3.3.4.tar.gz # configurationenvironmentvariable echo 'export HADOOP_HOME=/path/to/hadoop-3.3.4' >> ~/.bashrc echo 'export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin' >> ~/.bashrc source ~/.bashrc
3.3 configuration Hadoop
需要configuration Hadoop corefile:
- core-site.xml: configuration Hadoop coreparameter
- hdfs-site.xml: configuration HDFS parameter
- yarn-site.xml: configuration YARN parameter
- mapred-site.xml: configuration MapReduce parameter
4. Hive installationconfiguration
4.1 under 载 Hive
# under 载 Hive 3.1.3 wget https://archive.apache.org/dist/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz # 解压file tar -zxvf apache-hive-3.1.3-bin.tar.gz
4.2 configurationenvironmentvariable
echo 'export HIVE_HOME=/path/to/apache-hive-3.1.3-bin' >> ~/.bashrc echo 'export PATH=$PATH:$HIVE_HOME/bin' >> ~/.bashrc source ~/.bashrc
4.3 configuration Hive corefile
4.3.1 creation hive-site.xml
cd $HIVE_HOME/conf cp hive-default.xml.template hive-site.xml
4.3.2 configuration元datastore
默认circumstances under , Hive using Derby datalibrarystore元data. 但 in produceenvironmentin, 通常using MySQL or PostgreSQL.
using MySQL serving as元datastore:
- installation MySQL:
- creation Hive datalibrary and user:
- configuration hive-site.xml:
- under 载 MySQL 驱动:
sudo yum install mysql-server mysql-devel
mysql -u root -p CREATE DATABASE hive_meta; CREATE USER 'hive'@'localhost' IDENTIFIED BY 'password'; GRANT ALL PRIVILEGES ON hive_meta.* TO 'hive'@'localhost'; FLUSH PRIVILEGES;
<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive_meta?createDatabaseIfNotExist=true</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>password</value> </property>
wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.28.tar.gz tar -zxvf mysql-connector-java-8.0.28.tar.gz cp mysql-connector-java-8.0.28/mysql-connector-java-8.0.28.jar $HIVE_HOME/lib/
4.4 初始化元data
schematool -dbType mysql -initSchema
4.5 启动 Hive
# 启动 Hive commands行客户端 hive # or 者启动 Beeline 客户端 beeline -u jdbc:hive2://localhost:10000
4. Hive clusterinstallation
in produceenvironmentin, 通常需要installation Hive cluster. 以 under is 简要步骤:
4.1 configurationplanning
- 主node: installation Hive service and HiveServer2
- from node: 无需installation Hive
- 元datastore: 独立 MySQL or PostgreSQL datalibrary
4.2 installation步骤
- in 主node on installation Hive (同 on )
- configuration元datastore (同 on )
- in 所 has node on configuration Hive environmentvariable
- 启动 HiveServer2:
- verification连接:
hiveserver2 &
beeline -u jdbc:hive2://master:10000 -n username
5. commonissues and solution
5.1 初始化元data失败
issues: 执行 schematool commands时失败
solution:
- check MySQL service is 否正 in run
- check MySQL user名 and password is 否正确
- check MySQL 驱动 is 否正确放置 in $HIVE_HOME/lib Table of Contents under
5.2 HiveServer2 无法启动
issues: 执行 hiveserver2 commands after 无法启动
solution:
- check Hadoop cluster is 否正常run
- check hive-site.xml configuration is 否正确
- 查看logfile ($HIVE_HOME/logs) 获取详细errorinformation
5.3 客户端连接失败
issues: using Beeline 连接 HiveServer2 失败
solution:
- checknetwork连接 is 否正常
- check HiveServer2 is 否正 in run
- check防火墙设置, 确保端口 10000 开放
实践练习
练习1: Hive 单nodeinstallation
- 准备一台 Linux server
- installation JDK 1.8
- installation Hadoop 3.x
- installation MySQL
- installation并configuration Hive
- 初始化元data
- 启动 Hive 客户端并执行 simple query
练习2: Hive clusterinstallation
- 准备 3 台 Linux server
- 搭建 Hadoop cluster
- in 主node on installation并configuration Hive
- configuration MySQL serving as元datastore
- 启动 HiveServer2
- from othernodeusing Beeline 连接 Hive
练习3: configurationoptimization
- modify hive-site.xml in logpath
- configuration Hive 执行引擎 for Tez
- configuration Hive memoryusing
6. summarized
本tutorial介绍了 Hive basicconcepts, architecture and installationconfigurationmethod, including:
- Hive core特点 and and 传统datalibrary 区别
- Hive 分层architecturedesign
- Hive 单nodeinstallation 详细步骤
- using MySQL serving as元datastore
- Hive clusterinstallation basic流程
- common problem solutions
through本tutorial Learning, 您应该able to成功installation and configuration Hive environment, 并开始using Hive fordataquery and analysis.