Hivedata仓libraryBasics

深入LearningHivedata仓library basicconcepts, architecturedesign, working principles and usingmethod

Hivedata仓libraryBasics

1. HiveIntroduction

Hive is 建立 in Hadoop之 on data仓librarytool, 用于processinglarge-scalestructure化 and 半structure化data. 它providing了class似SQL querylanguageHQL (Hive Query Language) , 允许userusing熟悉 SQL语法来query and analysisHDFSin data.

1.1 Hive design目标

  • simple 易用: providingclass似SQL querylanguage, 让userable to fast 速 on 手
  • high scale性: 利用Hadoop distributed计算capacity, supportlarge-scaledataprocessing
  • flexible datamodel: supportstructure化 and 半structure化data
  • 良 good compatibility: 兼容现 has Hadoopecosystem

1.2 Hive and 传统datalibrary 区别

虽然Hiveproviding了class似SQL querylanguage, 但它 and 传统datalibrary存 in 以 under 区别:

features Hive 传统datalibrary
datastore store in HDFS on store in 本地filesystem or 专用store设备 on
计算引擎 默认usingMapReduce, 也supportTez, Sparketc. using自己 计算引擎
querylatency high latency, 适合批processing low latency, 适合交互式query
transactionsupport Hive 0.14+supporttransaction, 但functions has 限 完整supportACIDtransaction
dataupdate 主要support追加operation, update and deleteoperation has 限 完全support增删改查operation
scale性 high scale性, 可以easilyscalecluster规模 scale性 has 限, 通常需要垂直scale

2. Hivecore concepts

2.1 datamodel

Hive datamodelincluding以 under 几个层次:

  • datalibrary (Database) : 用于组织表, class似于传统datalibraryin datalibrary
  • 表 (Table) : 由行 and 列组成, class似于传统datalibraryin 表
  • partition (Partition) : 将表 data按照一定 规则for划分, improvingqueryefficiency
  • 分桶 (Bucket) : 将表 data进一步划分成更 small 桶, 用于data抽样 and high 效query
  • 视graph (View) : 虚拟表, 由query结果生成, 不storepracticaldata

2.2 元data (Metadata)

元data is describesdata data, Hive 元dataincluding表 structure, partitioninformation, 分桶informationetc.. HiveusingMetastore来store元data, Metastore可以using in 置 Derbydatalibrary, 也可以using out 部datalibrarysuch asMySQL.

2.3 HQL (Hive Query Language)

HQL is Hive querylanguage, class似于SQL, 但针 for Hadoop distributed计算environmentfor了optimization. HQL会被转换 for MapReduce, Tez or Sparkjob执行.

3. Hivearchitecturedesign

Hive architecturedesignincluding以 under 几个component:

3.1 客户端 (Client)

客户端 is user and Hive交互 interface, including:

  • CLI (Command Line Interface) : commands行界面, 允许user直接输入HQLcommands
  • JDBC/ODBC: 允许 out 部application程序throughJDBC or ODBC连接Hive
  • Web UI: 基于Web user界面, such asHue

3.2 驱动程序 (Driver)

驱动程序负责processingHQL语句 执行过程, including:

  • 编译器 (Compiler) : 将HQL编译 for abstraction语法tree (AST) , 然 after 转换 for 逻辑执行计划
  • optimization器 (optimizationr) : for 逻辑执行计划foroptimization, 生成最优 执行计划
  • 执行器 (Executor) : 将optimization after 执行计划转换 for 具体 MapReduce, Tez or Sparkjob, 并submitting to Hadoopcluster执行

3.3 Metastore

Metastore负责storeHive 元data, including:

  • datalibrary and 表 定义
  • 表 partition and 分桶information
  • 表 store位置 and 格式
  • 表 statisticsinformation

3.4 store and 计算

HiveusingHDFSstoredata, usingMapReduce, Tez or Sparkfor计算.

4. Hiveinstallation and configuration

4.1 installation before 提

installationHive之 before , 需要先installation:

  • Java JDK 1.8+
  • Hadoop 2.x or 3.x
  • (可选) MySQL or otherrelationships型datalibrary, 用于storeMetastore

4.2 installation步骤

# 1.  under 载Hive
$ wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz

# 2. 解压Hive
$ tar -xzvf apache-hive-3.1.3-bin.tar.gz
$ mv apache-hive-3.1.3-bin /opt/hive

# 3. configurationenvironmentvariable
$ export HIVE_HOME=/opt/hive
$ export PATH=$PATH:$HIVE_HOME/bin

# 4. configurationHive
$ cp $HIVE_HOME/conf/hive-env.sh.template $HIVE_HOME/conf/hive-env.sh
$ echo "export HADOOP_HOME=/opt/hadoop" >> $HIVE_HOME/conf/hive-env.sh

# 5. 初始化Metastore (using in 置Derby) 
$ schematool -dbType derby -initSchema

4.3 usingMySQLstoreMetastore

# 1. creationMySQLdatalibrary and user
$ mysql -u root -p
mysql> CREATE DATABASE metastore;
mysql> CREATE USER 'hive'@'localhost' IDENTIFIED BY 'password';
mysql> GRANT ALL PRIVILEGES ON metastore.* TO 'hive'@'localhost';
mysql> FLUSH PRIVILEGES;
mysql> exit;

# 2. configurationHiveusingMySQL
$ cp $HIVE_HOME/conf/hive-site.xml.template $HIVE_HOME/conf/hive-site.xml
$ vim $HIVE_HOME/conf/hive-site.xml

# 添加以 under configuration

  
    javax.jdo.option.ConnectionURL
    jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true
    JDBC connect string for a JDBC metastore
  
  
    javax.jdo.option.ConnectionDriverName
    com.mysql.jdbc.Driver
    Driver class name for a JDBC metastore
  
  
    javax.jdo.option.ConnectionUserName
    hive
    username to use against metastore database
  
  
    javax.jdo.option.ConnectionPassword
    password
    password to use against metastore database
  


# 3.  under 载MySQL驱动
$ wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.49.tar.gz
$ tar -xzvf mysql-connector-java-5.1.49.tar.gz
$ cp mysql-connector-java-5.1.49/mysql-connector-java-5.1.49-bin.jar $HIVE_HOME/lib/

# 4. 初始化Metastore (usingMySQL) 
$ schematool -dbType mysql -initSchema

5. Hivebasicoperation

5.1 datalibraryoperation

# 启动Hive CLI
$ hive

# 显示所 has datalibrary
hive> SHOW DATABASES;

# creationdatalibrary
hive> CREATE DATABASE mydb;

# usingdatalibrary
hive> USE mydb;

# 查看datalibraryinformation
hive> DESCRIBE DATABASE mydb;

# deletedatalibrary
hive> DROP DATABASE IF EXISTS mydb;

# deletepackage含表 datalibrary
hive> DROP DATABASE IF EXISTS mydb CASCADE;

5.2 表operation

# creation表
hive> CREATE TABLE employees (
    > id INT,
    > name STRING,
    > age INT,
    > department STRING
    > )
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE;

# 显示所 has 表
hive> SHOW TABLES;

# 查看表structure
hive> DESCRIBE employees;

# 查看表详细information
hive> DESCRIBE EXTENDED employees;

# modify表名
hive> ALTER TABLE employees RENAME TO staff;

# 添加列
hive> ALTER TABLE staff ADD COLUMNS (salary FLOAT);

# modify列名 and class型
hive> ALTER TABLE staff CHANGE COLUMN age employee_age INT;

# delete表
hive> DROP TABLE IF EXISTS staff;

5.3 data加载 and export

# creationexampledatafile
$ echo -e "1,Alice,30,HR\n2,Bob,28,IT\n3,Charlie,35,Finance" > employees.txt

# 加载data to 表in
hive> LOAD DATA LOCAL INPATH 'employees.txt' INTO TABLE employees;

#  from HDFS加载data
hive> LOAD DATA INPATH '/user/hadoop/employees.txt' INTO TABLE employees;

# exportdata to 本地filesystem
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/employees' 
    > SELECT * FROM employees;

# exportdata to HDFS
hive> INSERT OVERWRITE DIRECTORY '/user/hadoop/employees' 
    > SELECT * FROM employees;

5.4 queryoperation

#  simple query
hive> SELECT * FROM employees;

# 条件query
hive> SELECT name, age FROM employees WHERE department = 'IT';

# aggregatequery
hive> SELECT department, COUNT(*) as count FROM employees GROUP BY department;

# sortquery
hive> SELECT * FROM employees ORDER BY age DESC;

# 连接query
hive> CREATE TABLE departments (
    > id INT,
    > name STRING,
    > location STRING
    > )
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE;

hive> LOAD DATA LOCAL INPATH 'departments.txt' INTO TABLE departments;

hive> SELECT e.name, d.name, d.location 
    > FROM employees e 
    > JOIN departments d 
    > ON e.department = d.name;

6. Hivepartition and 分桶

6.1 partition表

partition表 is 将表 data按照一定 规则for划分, 每个partition for 应HDFS on 一个Table of Contents. partition可以improvingqueryefficiency, 因 for query时只需要扫描相关partition data.

# creationpartition表
hive> CREATE TABLE employees_partitioned (
    > id INT,
    > name STRING,
    > age INT
    > )
    > PARTITIONED BY (department STRING)
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE;

# 加载data to partition
hive> LOAD DATA LOCAL INPATH 'hr_employees.txt' INTO TABLE employees_partitioned PARTITION (department='HR');
hive> LOAD DATA LOCAL INPATH 'it_employees.txt' INTO TABLE employees_partitioned PARTITION (department='IT');

# querypartitiondata
hive> SELECT * FROM employees_partitioned WHERE department = 'HR';

# 显示所 has partition
hive> SHOW PARTITIONS employees_partitioned;

6.2 分桶表

分桶表 is 将表 data进一步划分成更 small 桶, 每个桶 for 应HDFS on 一个file. 分桶可以用于data抽样 and high 效query.

# creation分桶表
hive> CREATE TABLE employees_bucketed (
    > id INT,
    > name STRING,
    > age INT,
    > department STRING
    > )
    > CLUSTERED BY (id) INTO 4 BUCKETS
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ','
    > STORED AS TEXTFILE;

# 启用分桶
 hive> SET hive.enforce.bucketing = true;

# 插入data to 分桶表
hive> INSERT INTO TABLE employees_bucketed SELECT * FROM employees;

# data抽样
hive> SELECT * FROM employees_bucketed TABLESAMPLE(BUCKET 1 OUT OF 4 ON id);

实践练习

练习1: Hiveinstallation and configuration

installation并configurationHive, usingMySQLstoreMetastore.

练习2: datalibrary and 表operation

usingHive CLIcompletion以 under operation:

  1. creation一个名 for "company" datalibrary
  2. in 该datalibraryincreation一个名 for "employees" 表, package含id, name, age, department and salary字段
  3. 向表in插入一些exampledata
  4. query工资 big 于5000 员工information
  5. 按部门group, 计算每个部门 平均工资

练习3: partition表operation

creation一个partition表, 按照部门 and 年份forpartition, 并插入data.

练习4: 分桶表operation

creation一个分桶表, 并usingdata抽样query表in data.

7. summarized

本tutorial深入介绍了Hivedata仓library basicconcepts, architecturedesign and usingmethod. Hiveserving as建立 in Hadoop之 on data仓librarytool, providing了class似SQL querylanguageHQL, 允许userusing熟悉 SQL语法来query and analysisHDFSin data.

Hive corecomponentincluding客户端, 驱动程序, Metastore and store计算层. 客户端 is user and Hive交互 interface, 驱动程序负责processingHQL语句 执行过程, Metastore用于store元data, store计算层usingHDFSstoredata, usingMapReduce, Tez or Sparkfor计算.

Hivesupportdatalibrary, 表, partition and 分桶etc.datamodel, 可以throughHQLfordata 增删改查operation. Hive partition and 分桶functions可以improvingqueryefficiency, 适合processinglarge-scalestructure化 and 半structure化data.

虽然Hive and 传统datalibrary存 in 一些区别, such asquerylatency较 high , transactionsupport has 限etc., 但它具 has high scale性, flexible datamodel and 良 good compatibilityetc.优点, is big dataprocessing领域 important tool.