HiveQL advancedquery
1. advanced连接query
1.1 自连接
自连接 is 指表 and 自身for连接query, 常用于query同一表in具 has 层级relationships or 关联relationships data.
example: query员工及其经理information
-- creation带经理ID 员工表
CREATE TABLE IF NOT EXISTS emp_with_manager (
id INT,
name STRING,
department STRING,
salary DOUBLE,
manager_id INT
);
-- 插入exampledata
INSERT INTO emp_with_manager VALUES
(1, 'Alice', 'IT', 80000, NULL),
(2, 'Bob', 'IT', 60000, 1),
(3, 'Charlie', 'HR', 50000, NULL),
(4, 'David', 'HR', 45000, 3),
(5, 'Eve', 'IT', 70000, 1);
-- 自连接query员工及其经理information
SELECT e.name AS employee_name, e.department, e.salary,
m.name AS manager_name
FROM emp_with_manager e
LEFT JOIN emp_with_manager m ON e.manager_id = m.id;
1.2 交叉连接 (笛卡尔积)
交叉连接会返回两个表 笛卡尔积, 即第一个表 每一行 and 第二个表 每一行组合.
-- creation两个 simple 表
CREATE TABLE IF NOT EXISTS colors (
color STRING
);
CREATE TABLE IF NOT EXISTS sizes (
size STRING
);
-- 插入data
INSERT INTO colors VALUES ('Red'), ('Green'), ('Blue');
INSERT INTO sizes VALUES ('Small'), ('Medium'), ('Large');
-- 交叉连接
SELECT c.color, s.size
FROM colors c
CROSS JOIN sizes s;
1.3 半连接 and 反半连接
半连接 (IN) and 反半连接 (NOT IN) 用于check子query结果in is 否存 in 匹配值.
-- 半连接: queryIT部门 员工 SELECT * FROM employees WHERE department IN (SELECT department FROM departments WHERE location='Beijing'); -- 反半连接: query非IT部门 员工 SELECT * FROM employees WHERE department NOT IN (SELECT department FROM departments WHERE location='Beijing');
2. advanced子query
2.1 关联子query
关联子query is 指子queryin引用了父query 列, 执行时会 for 父query 每一行执行一次子query.
-- query每个部门in工资 high 于该部门平均工资 员工
SELECT e1.*
FROM employees e1
WHERE e1.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department = e1.department
);
2.2 EXISTS and NOT EXISTS
EXISTS and NOT EXISTS 用于check子query is 否返回结果, 通常比 IN and NOT IN 更 high 效.
-- EXISTS: query has under 属 员工
SELECT * FROM emp_with_manager m
WHERE EXISTS (
SELECT 1 FROM emp_with_manager e
WHERE e.manager_id = m.id
);
-- NOT EXISTS: query没 has under 属 员工
SELECT * FROM emp_with_manager m
WHERE NOT EXISTS (
SELECT 1 FROM emp_with_manager e
WHERE e.manager_id = m.id
);
2.3 标量子query
标量子query返回单行单列 结果, 可以用于SELECT子句, WHERE子句 or HAVING子句in.
-- query员工information及公司平均工资
SELECT e.*, (
SELECT AVG(salary) FROM employees
) AS avg_company_salary
FROM employees e;
-- query工资 high 于公司平均工资 员工
SELECT * FROM employees
WHERE salary > (
SELECT AVG(salary) FROM employees
);
3. 窗口function
窗口function is Hive 2.0及以 on versionsupport advancedfunctions, 用于 in query结果 一个窗口 in for计算, 而不改变结果集 行数.
3.1 窗口function basic语法
function_name() OVER (
[PARTITION BY partition_expression, ...]
[ORDER BY sort_expression [ASC|DESC], ...]
[ROWS BETWEEN frame_start AND frame_end]
)
3.2 common窗口function
3.2.1 sortfunction
- ROW_NUMBER(): for 每行分配一个唯一 序号
- RANK(): 分配排名, 相同值排名相同, under 一个排名跳过
- DENSE_RANK(): 分配排名, 相同值排名相同, under 一个排名连续
- NTILE(n): 将结果集分成n个桶, 每个桶分配一个序号
-- 按部门 and 工资sort, using不同 sortfunction
SELECT
department, name, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank_num,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank_num,
NTILE(2) OVER (PARTITION BY department ORDER BY salary DESC) AS ntile_num
FROM employees;
3.2.2 aggregate窗口function
- SUM(): 计算窗口 in 总 and
- AVG(): 计算窗口 in 平均值
- MIN(): 计算窗口 in 最 small 值
- MAX(): 计算窗口 in 最 big 值
- COUNT(): 计算窗口 in 行数
-- 计算累计工资 and move平均工资
SELECT
department, name, salary,
SUM(salary) OVER (PARTITION BY department ORDER BY salary) AS cumulative_salary,
AVG(salary) OVER (PARTITION BY department ORDER BY salary ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS moving_avg_salary
FROM employees;
3.2.3 analysisfunction
- LAG(): 返回窗口 in before n行 值
- LEAD(): 返回窗口 in after n行 值
- FIRST_VALUE(): 返回窗口 in 第一行 值
- LAST_VALUE(): 返回窗口 in 最 after 一行 值
-- usinganalysisfunctionquery员工工资 before after 值
SELECT
department, name, salary,
LAG(salary, 1, 0) OVER (PARTITION BY department ORDER BY salary) AS prev_salary,
LEAD(salary, 1, 0) OVER (PARTITION BY department ORDER BY salary) AS next_salary,
FIRST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary) AS first_salary,
LAST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary) AS last_salary
FROM employees;
3.3 窗口帧 (Window Frame)
窗口帧定义了窗口function计算 行范围, 默认 is from partition开始 to 当 before 行.
-- 不同窗口帧 example
SELECT
department, name, salary,
-- from partition开始 to 当 before 行
SUM(salary) OVER (PARTITION BY department ORDER BY salary) AS frame_default,
-- from 当 before 行向 before 3行 to 当 before 行
SUM(salary) OVER (PARTITION BY department ORDER BY salary ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS frame_3_prev_current,
-- from 当 before 行向 before 1行 to 向 after 1行
SUM(salary) OVER (PARTITION BY department ORDER BY salary ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS frame_1_prev_1_following,
-- from partition开始 to partition结束
SUM(salary) OVER (PARTITION BY department ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS frame_full
FROM employees;
4. advancedaggregateoperation
4.1 GROUPING SETS
GROUPING SETS允许 in 一个queryin指定 many 个group条件, 相当于 many 个GROUP BY子句 UNION ALL.
-- query不同group级别 销售data
SELECT
department, region, SUM(sales) AS total_sales
FROM sales_data
GROUP BY GROUPING SETS (
(department, region),
(department),
(region),
()
);
-- etc.价于以 under query UNION ALL
SELECT department, region, SUM(sales) FROM sales_data GROUP BY department, region
UNION ALL
SELECT department, NULL, SUM(sales) FROM sales_data GROUP BY department
UNION ALL
SELECT NULL, region, SUM(sales) FROM sales_data GROUP BY region
UNION ALL
SELECT NULL, NULL, SUM(sales) FROM sales_data;
4.2 CUBE
CUBE会生成所 has 可能 group组合, 相当于 for 指定列 所 has 子集forGROUP BY.
-- usingCUBEquery销售data
SELECT
department, region, SUM(sales) AS total_sales
FROM sales_data
GROUP BY CUBE(department, region);
-- etc.价于GROUPING SETS ((department, region), (department), (region), ())
4.3 ROLLUP
ROLLUP会生成 from 最详细 to 最汇总 group组合, 适合层次化data.
-- usingROLLUPquery销售data (按部门 and 地区 层次structure)
SELECT
department, region, SUM(sales) AS total_sales
FROM sales_data
GROUP BY ROLLUP(department, region);
-- etc.价于GROUPING SETS ((department, region), (department), ())
-- 按年份, 季度, 月份 层次structurequery
SELECT
year, quarter, month, SUM(sales) AS total_sales
FROM sales_data
GROUP BY ROLLUP(year, quarter, month);
5. complex dataclass型query
5.1 ARRAYclass型query
-- creationpackage含ARRAYclass型 表
CREATE TABLE IF NOT EXISTS students (
id INT,
name STRING,
courses ARRAY<STRING>
);
-- 插入data
INSERT INTO students VALUES
(1, 'Alice', ARRAY('Math', 'English', 'Science')),
(2, 'Bob', ARRAY('Math', 'History')),
(3, 'Charlie', ARRAY('English', 'Science', 'Art'));
-- queryarray long 度
SELECT id, name, courses, SIZE(courses) AS course_count FROM students;
-- querypackage含specific课程 学生
SELECT id, name FROM students WHERE array_contains(courses, 'Math');
-- unfoldarray (LATERAL VIEW + explode)
SELECT s.id, s.name, c.course
FROM students s
LATERAL VIEW explode(courses) c AS course;
5.2 MAPclass型query
-- creationpackage含MAPclass型 表
CREATE TABLE IF NOT EXISTS scores (
id INT,
name STRING,
subject_scores MAP<STRING, INT>
);
-- 插入data
INSERT INTO scores VALUES
(1, 'Alice', MAP('Math', 90, 'English', 85, 'Science', 95)),
(2, 'Bob', MAP('Math', 80, 'History', 75)),
(3, 'Charlie', MAP('English', 88, 'Science', 92, 'Art', 90));
-- queryspecific科目 成绩
SELECT id, name, subject_scores['Math'] AS math_score FROM scores;
-- querypackage含specific科目 学生
SELECT id, name FROM scores WHERE map_contains_key(subject_scores, 'Math');
-- unfoldMAP (LATERAL VIEW + explode)
SELECT s.id, s.name, ss.subject, ss.score
FROM scores s
LATERAL VIEW explode(subject_scores) ss AS subject, score;
5.3 STRUCTclass型query
-- creationpackage含STRUCTclass型 表
CREATE TABLE IF NOT EXISTS addresses (
id INT,
name STRING,
address STRUCT<city:STRING, street:STRING, zip:INT>
);
-- 插入data
INSERT INTO addresses VALUES
(1, 'Alice', NAMED_STRUCT('city', 'Beijing', 'street', 'Main St', 'zip', 100000)),
(2, 'Bob', NAMED_STRUCT('city', 'Shanghai', 'street', 'Second Ave', 'zip', 200000));
-- querySTRUCT字段
SELECT id, name, address.city, address.street, address.zip FROM addresses;
6. otheradvancedqueryfunctions
6.1 条件aggregate
条件aggregateusingCASE WHEN in aggregatefunctioninfor条件判断.
-- query各部门不同工资级别 员工数量
SELECT
department,
COUNT(CASE WHEN salary < 50000 THEN 1 END) AS low_salary_count,
COUNT(CASE WHEN salary >= 50000 AND salary < 80000 THEN 1 END) AS medium_salary_count,
COUNT(CASE WHEN salary >= 80000 THEN 1 END) AS high_salary_count
FROM employees
GROUP BY department;
6.2 行转列 and 列转行
usingCASE WHEN and aggregatefunctionimplementation行转列, usingLATERAL VIEW and explodeimplementation列转行.
-- 行转列example
SELECT
name,
MAX(CASE WHEN subject = 'Math' THEN score END) AS math,
MAX(CASE WHEN subject = 'English' THEN score END) AS english,
MAX(CASE WHEN subject = 'Science' THEN score END) AS science
FROM student_scores
GROUP BY name;
-- 列转行example
SELECT name, subject, score
FROM student_scores_pivot
LATERAL VIEW explode(ARRAY('Math', 'English', 'Science')) subj AS subject
LATERAL VIEW explode(ARRAY(math, english, science)) sc AS score
WHERE subj.subject IN ('Math', 'English', 'Science');
6.3 递归query
Hive 2.0及以 on versionsupport递归query, usingWITH RECURSIVE语法.
-- creation部门层级表
CREATE TABLE IF NOT EXISTS department_hierarchy (
dept_id INT,
dept_name STRING,
parent_dept_id INT
);
-- 插入data
INSERT INTO department_hierarchy VALUES
(1, 'Company', NULL),
(2, 'IT', 1),
(3, 'HR', 1),
(4, 'Engineering', 2),
(5, 'Operations', 2),
(6, 'Recruiting', 3),
(7, 'Training', 3);
-- 递归query部门层级
WITH RECURSIVE dept_tree AS (
-- 锚点query: 根node
SELECT dept_id, dept_name, parent_dept_id, 1 AS level
FROM department_hierarchy
WHERE parent_dept_id IS NULL
UNION ALL
-- 递归query: 子node
SELECT d.dept_id, d.dept_name, d.parent_dept_id, dt.level + 1 AS level
FROM department_hierarchy d
JOIN dept_tree dt ON d.parent_dept_id = dt.dept_id
)
SELECT * FROM dept_tree ORDER BY level, dept_id;
实践练习
练习1: 窗口function练习
- usingROW_NUMBER(), RANK() and DENSE_RANK()function for 员工按部门 and 工资排名
- usingSUM()窗口function计算员工 累计工资
- usingLAG() and LEAD()functionquery员工工资 before after 值
- usingNTILE()function将员工按工资分成5个组
练习2: advancedaggregateoperation
- usingGROUPING SETSquery不同group级别 销售data
- usingCUBEquery部门 and 地区 所 has 组合销售data
- usingROLLUP按年份, 季度, 月份query销售data
- using条件aggregatequery各部门不同工资级别 员工数量
练习3: complex dataclass型query
- creationpackage含ARRAYclass型 表, queryarray long 度 and package含specific元素 行
- usingLATERAL VIEW and explodeunfoldarray and MAP
- creationpackage含STRUCTclass型 表, querySTRUCT字段
练习4: 递归query
- creation部门层级表, 插入exampledata
- usingWITH RECURSIVEquery部门层级
- 计算每个部门 层级深度
7. summarized
本tutorial介绍了 HiveQL advancedqueryfunctions, including:
- advanced连接query: 自连接, 交叉连接, 半连接 and 反半连接
- advanced子query: 关联子query, EXISTS/NOT EXISTS, 标量子query
- 窗口function: sortfunction, aggregate窗口function, analysisfunction and 窗口帧
- advancedaggregateoperation: GROUPING SETS, CUBE, ROLLUP
- complex dataclass型query: ARRAY, MAP, STRUCT query and unfold
- otheradvancedfunctions: 条件aggregate, 行转列 and 列转行, 递归query
through本tutorial Learning, 您应该able toMaster HiveQL advancedqueryfunctions, 并able tousing这些functions解决 complex dataquery and analysisissues.