HiveQL Basics语法
1. HiveQL Introduction
HiveQL (Hive Query Language) is Hive providing class似 SQL querylanguage, 允许userusing熟悉 SQL 语法query and analysisstore in Hadoop distributedstoresystemin large-scaledata.
1.1 HiveQL and 标准 SQL 区别
- HiveQL support Hadoop ecosystem specificfunctions, such aspartition, 分桶etc.
- HiveQL 不support所 has 标准 SQL features, such astransaction (早期version) , update and deleteoperation 限制
- HiveQL query会转换 for MapReduce, Tez or Spark job执行
- HiveQL support自定义function and scale
1.2 HiveQL 主要functions
- datalibrary and 表 creation, modify and delete
- data加载 and export
- dataquery and analysis
- dataaggregate and statistics
- data连接 and 子query
- partition and 分桶management
2. datalibraryoperation
2.1 creationdatalibrary
creationdatalibrary basic语法:
CREATE DATABASE [IF NOT EXISTS] database_name [COMMENT database_comment] [LOCATION hdfs_path] [WITH DBPROPERTIES (property_name=property_value, ...)];
example:
-- creation名 for test_db datalibrary
CREATE DATABASE test_db;
-- creationdatalibrary并添加comment
CREATE DATABASE IF NOT EXISTS test_db
COMMENT 'This is a test database'
LOCATION '/user/hive/warehouse/test_db.db'
WITH DBPROPERTIES ('creator'='admin', 'create_date'='2025-01-15');
2.2 查看datalibrary
-- 查看所 has datalibrary SHOW DATABASES; -- 模糊querydatalibrary SHOW DATABASES LIKE 'test*'; -- 查看datalibrary详细information DESCRIBE DATABASE test_db; -- 查看datalibraryscaleinformation DESCRIBE DATABASE EXTENDED test_db;
2.3 usingdatalibrary
-- using指定datalibrary USE test_db;
2.4 modifydatalibrary
-- modifydatalibraryproperty
ALTER DATABASE test_db SET DBPROPERTIES ('creator'='user1');
-- modifydatalibrary位置
ALTER DATABASE test_db SET LOCATION '/new/path';
2.5 deletedatalibrary
-- delete空datalibrary DROP DATABASE test_db; -- 强制deletedatalibrary (package含表) DROP DATABASE IF EXISTS test_db CASCADE;
3. 表operation
3.1 creation表
creation表 basic语法:
CREATE TABLE [IF NOT EXISTS] table_name (col_name data_type [COMMENT col_comment], ...) [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)];
dataclass型
- basicdataclass型: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL, BOOLEAN, STRING, VARCHAR, CHAR, TIMESTAMP, DATE
- complex dataclass型: ARRAY, MAP, STRUCT, UNION
store格式
- TEXTFILE: 文本file格式 (默认)
- ORC: optimization 行列store格式
- PARQUET: 列式store格式
- SEQUENCEFILE: 二进制序列file格式
- AVRO: 基于JSON 序列化格式
example:
-- creationbasic表
CREATE TABLE IF NOT EXISTS employees (
id INT,
name STRING,
age INT,
department STRING,
salary DOUBLE
) COMMENT 'Employee information'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/employees';
-- creation带partition 表
CREATE TABLE IF NOT EXISTS logs (
id INT,
user_id STRING,
action STRING,
timestamp BIGINT
) COMMENT 'User logs'
PARTITIONED BY (date STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS ORC;
-- creation complex class型表
CREATE TABLE IF NOT EXISTS students (
id INT,
name STRING,
courses ARRAY<STRING>,
scores MAP<STRING, INT>,
address STRUCT<city:STRING, street:STRING, zip:INT>
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':'
LINES TERMINATED BY '\n';
3.2 查看表
-- 查看当 before datalibraryin 所 has 表 SHOW TABLES; -- 模糊query表 SHOW TABLES LIKE 'emp*'; -- 查看表structure DESCRIBE employees; DESC employees; -- 查看表详细information DESCRIBE EXTENDED employees; DESC FORMATTED employees;
3.3 modify表
-- modify表名
ALTER TABLE employees RENAME TO new_employees;
-- 添加列
ALTER TABLE employees ADD COLUMNS (email STRING COMMENT 'Employee email');
-- modify列
ALTER TABLE employees CHANGE COLUMN age employee_age INT COMMENT 'Employee age';
-- replace列
ALTER TABLE employees REPLACE COLUMNS (
id INT,
name STRING,
age INT,
email STRING
);
-- modify表property
ALTER TABLE employees SET TBLPROPERTIES ('creator'='admin');
3.4 delete表
-- delete表 DROP TABLE IF EXISTS employees; -- 清空表data (保留表structure) TRUNCATE TABLE employees;
4. data加载 and export
4.1 加载data to 表in
from 本地filesystem加载data:
LOAD DATA LOCAL INPATH '/path/to/local/file' [OVERWRITE] INTO TABLE table_name [PARTITION (partition_column=partition_value, ...)];
from HDFS 加载data:
LOAD DATA INPATH '/path/to/hdfs/file' [OVERWRITE] INTO TABLE table_name [PARTITION (partition_column=partition_value, ...)];
example:
-- from 本地加载data to employees 表 LOAD DATA LOCAL INPATH '/home/user/employees.csv' INTO TABLE employees; -- from HDFS 加载data并覆盖现 has data LOAD DATA INPATH '/user/hive/input/employees.csv' OVERWRITE INTO TABLE employees; -- 加载data to 指定partition LOAD DATA LOCAL INPATH '/home/user/logs_2025-01-15.txt' INTO TABLE logs PARTITION (date='2025-01-15');
4.2 from query结果插入data
-- 插入单行data INSERT INTO TABLE employees VALUES (1, 'John', 30, 'IT', 50000.0); -- from query结果插入data INSERT OVERWRITE TABLE employees_backup SELECT * FROM employees WHERE department='IT'; -- many 插入 FROM employees INSERT OVERWRITE TABLE it_employees SELECT * WHERE department='IT' INSERT OVERWRITE TABLE hr_employees SELECT * WHERE department='HR';
4.3 exportdata
-- exportdata to HDFS INSERT OVERWRITE DIRECTORY '/user/hive/output/employees' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' SELECT * FROM employees; -- exportdata to 本地filesystem export table employees to '/path/to/local/export';
5. basicqueryoperation
5.1 SELECT 语句
basic语法:
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [ORDER BY col_list [ASC|DESC]] [LIMIT [offset,] row_count];
example:
-- query所 has 列 SELECT * FROM employees; -- queryspecific列 SELECT id, name, department FROM employees; -- query去重结果 SELECT DISTINCT department FROM employees; -- using别名 SELECT id AS employee_id, name AS employee_name FROM employees; -- 带条件query SELECT * FROM employees WHERE department='IT' AND salary > 50000; -- sortquery SELECT * FROM employees ORDER BY salary DESC, name ASC; -- 限制结果数量 SELECT * FROM employees LIMIT 10; SELECT * FROM employees LIMIT 5 OFFSET 10;
5.2 aggregatefunction
常用aggregatefunction:
- COUNT(): 计算行数
- SUM(): 计算总 and
- AVG(): 计算平均值
- MIN(): 计算最 small 值
- MAX(): 计算最 big 值
example:
-- 计算员工总数 SELECT COUNT(*) AS total_employees FROM employees; -- 计算平均工资 SELECT AVG(salary) AS avg_salary FROM employees; -- 计算各部门员工数量 SELECT department, COUNT(*) AS emp_count FROM employees GROUP BY department; -- 计算各部门平均工资并筛选 SELECT department, AVG(salary) AS avg_salary FROM employees GROUP BY department HAVING AVG(salary) > 50000;
5.3 连接query
support 连接class型:
- INNER JOIN: in 连接
- LEFT JOIN: left 连接
- RIGHT JOIN: right 连接
- FULL OUTER JOIN: 全 out 连接
- CROSS JOIN: 交叉连接
example:
-- creation部门表
CREATE TABLE departments (
dept_id INT,
dept_name STRING,
location STRING
);
-- in 连接
SELECT e.id, e.name, e.department, d.location
FROM employees e
INNER JOIN departments d ON e.department = d.dept_name;
-- left 连接
SELECT e.id, e.name, d.dept_name, d.location
FROM employees e
LEFT JOIN departments d ON e.department = d.dept_name;
-- right 连接
SELECT e.id, e.name, d.dept_name, d.location
FROM employees e
RIGHT JOIN departments d ON e.department = d.dept_name;
-- 全 out 连接
SELECT e.id, e.name, d.dept_name, d.location
FROM employees e
FULL OUTER JOIN departments d ON e.department = d.dept_name;
5.4 子query
example:
-- 子queryserving as条件
SELECT * FROM employees
WHERE department IN (SELECT dept_name FROM departments WHERE location='Beijing');
-- 子queryserving as表
SELECT dept_name, emp_count
FROM (
SELECT department AS dept_name, COUNT(*) AS emp_count
FROM employees
GROUP BY department
) AS dept_stats
WHERE emp_count > 10;
-- EXISTS 子query
SELECT * FROM employees e
WHERE EXISTS (SELECT 1 FROM departments d WHERE e.department = d.dept_name AND d.location='Shanghai');
6. commonfunction
6.1 stringfunction
-- string long 度
SELECT LENGTH(name) FROM employees;
-- string拼接
SELECT CONCAT(name, ' - ', department) FROM employees;
-- string分割
SELECT SPLIT('a,b,c', ',');
-- stringreplace
SELECT REPLACE('Hello World', 'World', 'Hive');
-- string截取
SELECT SUBSTRING(name, 1, 3) FROM employees;
-- string转 big 写/ small 写
SELECT UPPER(name), LOWER(name) FROM employees;
6.2 数值function
-- 绝 for 值 SELECT ABS(-100); -- 向 on /向 under 取整 SELECT CEIL(3.14), FLOOR(3.14); -- 四舍五入 SELECT ROUND(3.14159, 2); -- 随机数 SELECT RAND(); -- 幂运算 SELECT POW(2, 3); -- 取模 SELECT MOD(10, 3);
6.3 日期function
-- 当 before 日期/时间
SELECT CURRENT_DATE(), CURRENT_TIMESTAMP();
-- 日期format
SELECT FROM_UNIXTIME(UNIX_TIMESTAMP(), 'yyyy-MM-dd HH:mm:ss');
-- 日期差值
SELECT DATEDIFF('2025-01-15', '2025-01-01');
-- 日期添加
SELECT DATE_ADD('2025-01-01', 10);
-- 日期提取
SELECT YEAR('2025-01-15'), MONTH('2025-01-15'), DAY('2025-01-15');
6.4 条件function
-- CASE 语句
SELECT name, salary,
CASE
WHEN salary > 80000 THEN 'High'
WHEN salary > 50000 THEN 'Medium'
ELSE 'Low'
END AS salary_level
FROM employees;
-- IF function
SELECT name, IF(salary > 50000, 'High', 'Low') AS salary_level FROM employees;
-- COALESCE function
SELECT COALESCE(email, 'no-email@example.com') FROM employees;
实践练习
练习1: datalibraryoperation
- creation一个名 for company_db datalibrary, 添加适当 comment and property
- 查看所 has datalibrary, 确认 company_db 已creation
- using company_db datalibrary
- 查看 company_db 详细information
- modify company_db property
练习2: 表operation
- in company_db increation employees 表, package含 id, name, age, department, salary 列
- creation一个带partition orders 表, 按日期partition
- 查看 employees 表 structure and 详细information
- 向 employees 表添加 email 列
- 将 employees 表rename for staff
练习3: data加载 and query
- creation一个本地 CSV file, package含员工data
- 将本地 CSV file加载 to staff 表in
- query所 has 员工information
- query IT 部门 员工, 按工资降序sort
- 计算各部门 员工数量 and 平均工资
- query工资 high 于平均工资 员工
练习4: functionusing
- usingstringfunctionprocessing员工姓名
- using数值function计算员工工资 各种statistics值
- using日期functionprocessing日期data
- using条件function for 员工工资foretc.级划分
7. summarized
本tutorial介绍了 HiveQL basic语法 and 常用operation, including:
- HiveQL basicconcepts and and 标准 SQL 区别
- datalibrary creation, 查看, using and deleteoperation
- 表 creation, modify, 查看 and deleteoperation
- data加载 and exportmethod
- basicqueryoperation, including SELECT, WHERE, ORDER BY, LIMIT etc.
- aggregatefunction and groupquery
- 连接query and 子query
- commonfunction using
through本tutorial Learning, 您应该able toMaster HiveQL basic语法, 并able tousing HiveQL forbasic dataquery and analysisoperation.