Hive 视graph and index

UnderstandHive 视graph and indexfunctions, Learningsuch as何creation and using视graph, 以及不同class型index 特点

1. Hive 视graphoverview

视graph is Hivein用于简化 complex query and data访问控制 important tool. 视graph本质 on is a 预定义 query语句, 它将query结果以虚拟表 形式呈现给user.

1.1 视graph features

  • 虚拟表: 视graph本身不storedata, 只storequery定义
  • 简化query: 将 complex queryencapsulation for simple 视graph, user可以直接query视graph而无需writing complex SQL
  • datasecurity: 可以限制user只能访问视graphin specific列, implementationdata访问控制
  • 逻辑abstraction: 隐藏底层表structure 变化, providing stable data访问interface
  • 只读性: 默认circumstances under , Hive视graph is 只读 , 不supportdatamodifyoperation

1.2 视graph 优势

  • 简化 complex query, improvingqueryreusability
  • implementationdata访问控制, 保护敏感data
  • providingdataabstraction, 隔离底层表structure变化
  • reducing重复code, improvingDevelopmentefficiency

2. creation and using视graph

2.1 creation视graph

usingCREATE VIEW语句creation视graph:

-- creationbasic视graph
CREATE VIEW user_info AS
SELECT user_id, username, email, gender
FROM users
WHERE status = 'active';

-- creation带comment 视graph
CREATE VIEW sales_summary (
    product_category,
    total_sales,
    avg_price
) COMMENT '产品销售汇总视graph'
AS
SELECT 
    category,
    SUM(amount) AS total_sales,
    AVG(price) AS avg_price
FROM sales
GROUP BY category;

-- creation物化视graph (Hive 3.0+support) 
CREATE MATERIALIZED VIEW mv_sales_daily
AS
SELECT 
    order_date,
    SUM(amount) AS daily_sales,
    COUNT(*) AS order_count
FROM orders
GROUP BY order_date;

-- creation带partition 视graph
CREATE VIEW logs_today AS
SELECT id, message, ip
FROM logs
WHERE dt = current_date();

2.2 query视graph

query视graph 方式 and query普通表相同:

-- querybasic视graph
SELECT * FROM user_info WHERE gender = 'female';

-- queryaggregate视graph
SELECT product_category, total_sales
FROM sales_summary
ORDER BY total_sales DESC LIMIT 10;

-- query物化视graph
SELECT order_date, daily_sales
FROM mv_sales_daily
WHERE order_date BETWEEN '2025-01-01' AND '2025-01-31';

2.3 查看视graph定义

usingDESCRIBE or SHOW CREATE VIEW查看视graph定义:

-- 查看视graphstructure
DESCRIBE user_info;
DESCRIBE EXTENDED user_info;
DESCRIBE FORMATTED user_info;

-- 查看视graphcreation语句
SHOW CREATE VIEW user_info;

2.4 modify and delete视graph

usingALTER VIEWmodify视graph, usingDROP VIEWdelete视graph:

-- modify视graph定义
ALTER VIEW user_info AS
SELECT user_id, username, email, gender, age
FROM users
WHERE status = 'active';

-- modify视graphproperty
ALTER VIEW user_info SET TBLPROPERTIES ('comment' = '活跃userinformation视graph');

-- delete视graph
DROP VIEW IF EXISTS user_info;

-- delete many 个视graph
DROP VIEW IF EXISTS sales_summary, logs_today;

2.5 update物化视graph

物化视graph需要手动刷 new data:

-- 刷 new 物化视graph (全量刷 new ) 
REFRESH MATERIALIZED VIEW mv_sales_daily;

-- 增量刷 new 物化视graph (such as果support) 
REFRESH MATERIALIZED VIEW mv_sales_daily INCREMENTAL;

3. Hive indexoverview

index is Hivein用于加速query techniques. throughcreationindex, 可以reducingquery时需要扫描 data量, improvingqueryperformance.

3.1 index working principles

index working principles is :

  1. for 表in specific列creationindex
  2. index会store列值及其 for 应 file位置
  3. 当queryusing该列serving asfilter条件时, Hive会先queryindex, 找 to 匹配 data位置
  4. 然 after 只扫描package含匹配data file, reducingI/Ooperation

3.2 index class型

Hivesupport many 种class型 index:

  • COMPACTindex: storeindex列值 and for 应 file偏移量
  • BLOOM FILTERindex: using布隆filter器 fast 速判断值 is 否存 in
  • BITMAPindex: 适合 low 基数列, store每个值 for 应 行位置
  • 布隆filter器index: in ORC and Parquetfile格式in in 置support

4. creation and usingindex

4.1 creationindex

usingCREATE INDEX语句creationindex:

-- creationCOMPACTindex
CREATE INDEX idx_users_name
ON TABLE users (username)
AS 'COMPACT'
WITH DEFERRED REBUILD
IDXPROPERTIES ('creator'='admin', 'comment'='user名称index')
IN TABLE idx_users_name_table
COMMENT 'user名称index表';

-- creationBLOOM FILTERindex
CREATE INDEX idx_products_category
ON TABLE products (category)
AS 'BLOOM'
WITH DEFERRED REBUILD
IDXPROPERTIES ('false.positive.rate'='0.05')
IN TABLE idx_products_category_table;

--  in ORCfileincreation in 置布隆filter器
CREATE TABLE orc_table (
    id INT,
    name STRING,
    category STRING
) STORED AS ORC
TBLPROPERTIES (
    'orc.bloom.filter.columns'='category',
    'orc.bloom.filter.fpp'='0.05'
);

4.2 构建index

usingALTER INDEX语句构建index:

-- 构建index
ALTER INDEX idx_users_name ON users REBUILD;
ALTER INDEX idx_products_category ON products REBUILD;

4.3 queryindex

usingindex不需要特殊语法, Hive会自动using合适 index:

-- 自动usingindex
SELECT * FROM users WHERE username = 'zhangsan';
SELECT * FROM products WHERE category = 'electronics';

4.4 查看index

usingSHOW INDEXES or DESCRIBE查看indexinformation:

-- 查看表 所 has index
SHOW INDEXES ON users;

-- 查看index详情
DESCRIBE EXTENDED INDEX idx_users_name ON users;

-- 查看index表structure
DESCRIBE idx_users_name_table;

4.5 deleteindex

usingDROP INDEX语句deleteindex:

-- deleteindex
DROP INDEX IF EXISTS idx_users_name ON users;
DROP INDEX IF EXISTS idx_products_category ON products;

5. 视graph and index for 比

features 视graph index
store方式 只storequery定义, 不storedata storeindex列值 and data位置map
主要用途 简化query, datasecurity, 逻辑abstraction 加速query, reducingI/Ooperation
适用场景 complex queryencapsulation, data访问控制 频繁用于filter条件 列
maintenance成本 low , 只需要maintenance视graph定义 high , 需要定期updateindex
performance影响 query时需要重 new 执行视graph定义 query时加速, 写入时减速
dataconsistency 始终 and 底层表保持一致 需要手动 or 自动update以保持一致

6. best practices

6.1 视graphbest practices

  • 合理using视graph: 只 for complex , 频繁using querycreation视graph
  • 保持视graph simple : 视graph定义不宜过于 complex , 避免嵌套过深
  • usingcomment: for 视graph添加清晰 comment, 说明其用途 and 逻辑
  • permission控制: 利用视graphimplementationdata访问控制, 限制user只能访问specific列
  • 定期maintenance: 当底层表structure变化时, 及时update相关视graph
  • 考虑物化视graph: for 于频繁query aggregate视graph, 考虑using物化视graphimprovingperformance

6.2 indexbest practices

  • 选择合适 列: for 频繁用于WHERE子句 列creationindex
  • 考虑列 基数: high 基数列适合usingCOMPACTindex, low 基数列适合usingBITMAPindex
  • 控制index数量: 避免 for 过 many 列creationindex, 每个index都会增加写入开销
  • 定期updateindex: 确保index and data保持一致, 避免using过期index
  • 结合file格式: for 于ORC or Parquetfile, using in 置 布隆filter器比传统index更 high 效
  • testindex效果: creationindex before after testqueryperformance, 确保index确实improving了performance

6.3 Notes

Notes

  • 视graph默认 is 只读 , 不supportINSERT, UPDATE, DELETEoperation
  • index会增加写入operation 开销, 需要权衡queryperformance and 写入performance
  • 物化视graph需要手动刷 new , 否则data可能过时
  • in Hive 3.0+in, 传统index已被标记 for 废弃, 建议using物化视graph or file in 置index
  • for 于 small 表, index可能不会带来performance提升, 反而会增加maintenance成本

7. 视graph and index practicalapplicationcase

7.1 视graphapplicationcase

case1: data访问控制

场景describes

公司 has 一个package含敏感information users表, 需要限制不同部门只能访问自己部门 userdata.

solution

-- creation销售部门视graph
CREATE VIEW sales_users AS
SELECT user_id, username, email
FROM users
WHERE department = 'sales';

-- creationtechniques部门视graph
CREATE VIEW tech_users AS
SELECT user_id, username, email
FROM users
WHERE department = 'tech';

--  for 不同部门user分配不同视graph 访问permission
grant select on sales_users to role sales_role;
grant select on tech_users to role tech_role;

case2: complex queryencapsulation

场景describes

需要频繁query产品销售排名, 涉及 many 个表 关联 and complex aggregate.

solution

-- creation产品销售排名视graph
CREATE VIEW product_sales_rank AS
SELECT 
    p.product_id,
    p.product_name,
    p.category,
    SUM(s.amount) AS total_sales,
    RANK() OVER (PARTITION BY p.category ORDER BY SUM(s.amount) DESC) AS rank_in_category
FROM products p
JOIN sales s ON p.product_id = s.product_id
JOIN orders o ON s.order_id = o.order_id
WHERE o.order_date BETWEEN date_sub(current_date(), 30) AND current_date()
GROUP BY p.product_id, p.product_name, p.category;

using视graph after , query产品销售排名变得非常 simple :

-- query所 has 产品 销售排名
SELECT * FROM product_sales_rank ORDER BY total_sales DESC;

-- queryspecificclass别 销售排名
SELECT * FROM product_sales_rank WHERE category = 'electronics';

7.2 indexapplicationcase

case: 加速userquery

场景describes

users表package含百万级data, 经常需要根据usernamequeryuserinformation, queryperformance较差.

solution

-- creationusername列 index
CREATE INDEX idx_users_username
ON TABLE users (username)
AS 'COMPACT'
WITH DEFERRED REBUILD
IN TABLE idx_users_username_table;

-- 构建index
ALTER INDEX idx_users_username ON users REBUILD;

creationindex after , 根据usernamequery performance显著提升:

-- index加速query
SELECT * FROM users WHERE username = 'zhangsan';
SELECT user_id, email FROM users WHERE username LIKE 'zhang%';

8. summarized

视graph and index is Hivein important datamanagement and queryoptimizationtool, 它们各自 has 着不同 用途 and 优势:

  • 视graph: 主要用于简化 complex query, implementationdata访问控制 and providingdataabstraction, 适合频繁using complex query场景
  • index: 主要用于加速query, reducingI/Ooperation, 适合频繁用于filter条件 列

in practicalapplicationin, 需要根据具体 业务场景 and performancerequirements, 合理选择 and using视graph and index. for 于Hive 3.0+version, 推荐using物化视graph and file in 置index (such asORC 布隆filter器) , 而不 is 传统 Hiveindex.

through合理using视graph and index, 可以显著improvingHive queryperformance and Developmentefficiency, 同时增强datasecurity性 and 可maintenance性.