Hive Introduction and installationconfiguration

UnderstandHive basicconcepts, architecture and installationconfigurationmethod

Hive Introduction and installationconfiguration

1. Hive basicconcepts

Apache Hive is a 基于 Hadoop data仓librarytool, providing了class似 SQL querylanguage (HiveQL) , 允许userusing SQL 语法query and analysisstore in HDFS or otherdistributedstoresystemin large-scaledata.

1.1 Hive core特点

  • class SQL 语法: providingclass似 SQL HiveQL, 降 low Learning门槛
  • 可scale性: 基于 Hadoop, support水平scale
  • flexible性: support many 种store格式 and 压缩方式
  • data仓libraryfunctions: supportpartition, 分桶, indexetc.
  • 集成性: and Hadoop ecosystem无缝集成

1.2 Hive and 传统datalibrary 区别

features Hive 传统datalibrary
querylanguage HiveQL (class SQL) SQL
执行引擎 MapReduce/Tez/Spark in 置执行引擎
latency high latency (批processing) low latency (实时processing)
data规模 PB 级 GB/TB 级
updateoperation has 限support 完全support

2. Hive architecture

Hive adopts分层architecturedesign, 主要including以 under component:

2.1 userinterface层

  • CLI: commands行interface, 直接 and Hive 交互
  • HiveServer2: support JDBC/ODBC 连接, 允许远程访问
  • Web UI: providing Web 界面, 方便management and monitor

2.2 驱动层

  • JDBC 驱动: Java datalibrary连接驱动
  • ODBC 驱动: 开放datalibrary连接驱动
  • Thrift 客户端: 基于 Thrift protocol 客户端

2.3 coreservice层

  • query解析器: 将 HiveQL 转换 for abstraction语法tree
  • queryoptimization器: optimizationquery计划, improving执行efficiency
  • 执行计划生成器: 生成可执行 job计划

2.4 元datastore

store Hive 元datainformation, including:

  • datalibrary and 表 structure
  • partition and 分桶information
  • 列class型 and property
  • 表 store位置 and 格式

2.5 执行引擎

  • MapReduce: 传统 批processing引擎
  • Tez: optimization DAG 执行引擎
  • Spark: fast 速 memory计算引擎

3. Hive installation before 准备

3.1 environment要求

  • Java: JDK 1.8 or 更 high version
  • Hadoop: Hadoop 2.x or 3.x version
  • operationsystem: Linux (推荐 CentOS/Ubuntu)

3.2 installation Hadoop (简要)

such as果还没 has installation Hadoop, 可以按照以 under 步骤forinstallation:

#  under 载 Hadoop
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

# 解压file
tar -zxvf hadoop-3.3.4.tar.gz

# configurationenvironmentvariable
echo 'export HADOOP_HOME=/path/to/hadoop-3.3.4' >> ~/.bashrc
echo 'export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin' >> ~/.bashrc
source ~/.bashrc

3.3 configuration Hadoop

需要configuration Hadoop corefile:

  • core-site.xml: configuration Hadoop coreparameter
  • hdfs-site.xml: configuration HDFS parameter
  • yarn-site.xml: configuration YARN parameter
  • mapred-site.xml: configuration MapReduce parameter

4. Hive installationconfiguration

4.1 under 载 Hive

#  under 载 Hive 3.1.3
wget https://archive.apache.org/dist/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz

# 解压file
tar -zxvf apache-hive-3.1.3-bin.tar.gz

4.2 configurationenvironmentvariable

echo 'export HIVE_HOME=/path/to/apache-hive-3.1.3-bin' >> ~/.bashrc
echo 'export PATH=$PATH:$HIVE_HOME/bin' >> ~/.bashrc
source ~/.bashrc

4.3 configuration Hive corefile

4.3.1 creation hive-site.xml
cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml
4.3.2 configuration元datastore

默认circumstances under , Hive using Derby datalibrarystore元data. 但 in produceenvironmentin, 通常using MySQL or PostgreSQL.

using MySQL serving as元datastore:
  1. installation MySQL:
  2. sudo yum install mysql-server mysql-devel
    
  3. creation Hive datalibrary and user:
  4. mysql -u root -p
    CREATE DATABASE hive_meta;
    CREATE USER 'hive'@'localhost' IDENTIFIED BY 'password';
    GRANT ALL PRIVILEGES ON hive_meta.* TO 'hive'@'localhost';
    FLUSH PRIVILEGES;
    
  5. configuration hive-site.xml:
  6. <property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:mysql://localhost:3306/hive_meta?createDatabaseIfNotExist=true</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>hive</value>
    </property>
    <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>password</value>
    </property>
    
  7. under 载 MySQL 驱动:
  8. wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.28.tar.gz
    tar -zxvf mysql-connector-java-8.0.28.tar.gz
    cp mysql-connector-java-8.0.28/mysql-connector-java-8.0.28.jar $HIVE_HOME/lib/
    

4.4 初始化元data

schematool -dbType mysql -initSchema

4.5 启动 Hive

# 启动 Hive commands行客户端
hive

#  or 者启动 Beeline 客户端
beeline -u jdbc:hive2://localhost:10000

4. Hive clusterinstallation

in produceenvironmentin, 通常需要installation Hive cluster. 以 under is 简要步骤:

4.1 configurationplanning

  • 主node: installation Hive service and HiveServer2
  • from node: 无需installation Hive
  • 元datastore: 独立 MySQL or PostgreSQL datalibrary

4.2 installation步骤

  1. in 主node on installation Hive (同 on )
  2. configuration元datastore (同 on )
  3. in 所 has node on configuration Hive environmentvariable
  4. 启动 HiveServer2:
  5. hiveserver2 &
    
  6. verification连接:
  7. beeline -u jdbc:hive2://master:10000 -n username
    

5. commonissues and solution

5.1 初始化元data失败

issues: 执行 schematool commands时失败

solution:

  • check MySQL service is 否正 in run
  • check MySQL user名 and password is 否正确
  • check MySQL 驱动 is 否正确放置 in $HIVE_HOME/lib Table of Contents under

5.2 HiveServer2 无法启动

issues: 执行 hiveserver2 commands after 无法启动

solution:

  • check Hadoop cluster is 否正常run
  • check hive-site.xml configuration is 否正确
  • 查看logfile ($HIVE_HOME/logs) 获取详细errorinformation

5.3 客户端连接失败

issues: using Beeline 连接 HiveServer2 失败

solution:

  • checknetwork连接 is 否正常
  • check HiveServer2 is 否正 in run
  • check防火墙设置, 确保端口 10000 开放

实践练习

练习1: Hive 单nodeinstallation

  1. 准备一台 Linux server
  2. installation JDK 1.8
  3. installation Hadoop 3.x
  4. installation MySQL
  5. installation并configuration Hive
  6. 初始化元data
  7. 启动 Hive 客户端并执行 simple query

练习2: Hive clusterinstallation

  1. 准备 3 台 Linux server
  2. 搭建 Hadoop cluster
  3. in 主node on installation并configuration Hive
  4. configuration MySQL serving as元datastore
  5. 启动 HiveServer2
  6. from othernodeusing Beeline 连接 Hive

练习3: configurationoptimization

  1. modify hive-site.xml in logpath
  2. configuration Hive 执行引擎 for Tez
  3. configuration Hive memoryusing

6. summarized

本tutorial介绍了 Hive basicconcepts, architecture and installationconfigurationmethod, including:

  • Hive core特点 and and 传统datalibrary 区别
  • Hive 分层architecturedesign
  • Hive 单nodeinstallation 详细步骤
  • using MySQL serving as元datastore
  • Hive clusterinstallation basic流程
  • common problem solutions

through本tutorial Learning, 您应该able to成功installation and configuration Hive environment, 并开始using Hive fordataquery and analysis.