1. loganalysiscase
1.1 场景describes
某电商网站每天产生 big 量 访问log, 需要usingHive for 这些logforanalysis, Understanduserbehavior, 访问模式 and 业务指标.
1.2 log格式
192.168.1.1 - - [10/Jan/2025:13:55:36 +0000] "GET /product/123 HTTP/1.1" 200 1234 "http://example.com/category/electronics" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36" 192.168.1.2 - - [10/Jan/2025:13:56:42 +0000] "POST /cart/add HTTP/1.1" 302 0 "http://example.com/product/123" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36" 192.168.1.3 - - [10/Jan/2025:13:57:15 +0000] "GET /checkout HTTP/1.1" 200 2345 "http://example.com/cart" "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Mobile/15E148 Safari/604.1"
1.3 solution
1.3.1 creationlog表
-- creation原始log表
CREATE TABLE ods_access_logs (
remote_addr STRING, -- 客户端IP
remote_user STRING, -- 远程user
time_local STRING, -- 本地时间
request STRING, -- requestURL
status STRING, -- responsestatus码
body_bytes_sent STRING, -- response big small
http_referer STRING, -- 来sourcesURL
http_user_agent STRING -- userproxy
) PARTITIONED BY (dt STRING) -- 按日期partition
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' -- using正则表达式解析
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \"([^"]*)\" ([^ ]*) ([^ ]*) \"([^"]*)\" \"([^"]*)\""
)
STORED AS TEXTFILE;
-- creationlog解析表 (DW层)
CREATE TABLE dw_access_logs (
remote_addr STRING,
time_local TIMESTAMP,
url STRING,
method STRING,
status INT,
referer STRING,
user_agent STRING,
ip_country STRING,
ip_province STRING,
browser STRING,
os STRING
) PARTITIONED BY (year INT, month INT, day INT) -- 按年月日partition
STORED AS ORC;
1.3.2 解析logdata
-- 解析log并插入 to DW层表
INSERT INTO dw_access_logs PARTITION(year=2025, month=1, day=10)
SELECT
remote_addr,
FROM_UNIXTIME(UNIX_TIMESTAMP(time_local, '[dd/MMM/yyyy:HH:mm:ss Z]')) AS time_local,
regexp_extract(request, '([^ ]+)', 1) AS method,
regexp_extract(request, '([^ ]+)', 2) AS url,
CAST(status AS INT) AS status,
http_referer AS referer,
http_user_agent AS user_agent,
-- IP地址解析 (可usingUDFimplementation)
parse_ip_country(remote_addr) AS ip_country,
parse_ip_province(remote_addr) AS ip_province,
-- 浏览器 and operationsystem解析 (可usingUDFimplementation)
parse_browser(http_user_agent) AS browser,
parse_os(http_user_agent) AS os
FROM ods_access_logs
WHERE dt = '2025-01-10';
1.3.3 loganalysisquery
-- 1. statisticsPV and UV
SELECT
COUNT(*) AS pv,
COUNT(DISTINCT remote_addr) AS uv
FROM dw_access_logs
WHERE year=2025 AND month=1 AND day=10;
-- 2. 按 small 时statistics访问量
SELECT
HOUR(time_local) AS hour,
COUNT(*) AS pv
FROM dw_access_logs
WHERE year=2025 AND month=1 AND day=10
GROUP BY HOUR(time_local)
ORDER BY hour;
-- 3. statisticsTop 10热门页面
SELECT
url,
COUNT(*) AS pv
FROM dw_access_logs
WHERE year=2025 AND month=1 AND day=10
GROUP BY url
ORDER BY pv DESC
LIMIT 10;
-- 4. statistics浏览器分布
SELECT
browser,
COUNT(*) AS pv,
COUNT(DISTINCT remote_addr) AS uv
FROM dw_access_logs
WHERE year=2025 AND month=1 AND day=10
GROUP BY browser
ORDER BY pv DESC;
-- 5. statistics来sources网站TOP 10
SELECT
regexp_extract(referer, 'https?://([^/]+)', 1) AS referer_domain,
COUNT(*) AS pv
FROM dw_access_logs
WHERE year=2025 AND month=1 AND day=10
AND referer != '-' AND referer NOT LIKE '%example.com%' -- 排除直接访问 and in 部跳转
GROUP BY referer_domain
ORDER BY pv DESC
LIMIT 10;
2. data仓library构建case
2.1 场景describes
某零售企业需要构建data仓library, 整合线 on 线 under 销售data, implementation销售analysis, library存management and 客户画像etc.functions.
2.2 data仓libraryarchitecture
adopts经典 分层architecture:
- ODS层: store原始销售data and library存data
- DW层: 构建销售事实表 and 维度表
- DM层: 构建销售汇总表 and 客户画像表
2.3 solution
2.3.1 creationODS层表
-- creation线 on 销售表
CREATE TABLE ods_online_sales (
order_id STRING,
user_id STRING,
product_id STRING,
quantity INT,
amount DECIMAL(10,2),
order_time STRING,
payment_method STRING,
delivery_address STRING
) PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
-- creation线 under 销售表
CREATE TABLE ods_offline_sales (
transaction_id STRING,
store_id STRING,
product_id STRING,
quantity INT,
amount DECIMAL(10,2),
transaction_time STRING,
customer_id STRING
) PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
-- creationlibrary存表
CREATE TABLE ods_inventory (
product_id STRING,
store_id STRING,
quantity INT,
update_time STRING
) PARTITIONED BY (dt STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
2.3.2 creationDW层表
-- creation产品维度表
CREATE TABLE dw_product_dim (
product_id STRING,
product_name STRING,
category_id STRING,
category_name STRING,
brand STRING,
price DECIMAL(10,2),
status STRING,
create_time TIMESTAMP
) STORED AS ORC;
-- creation时间维度表
CREATE TABLE dw_time_dim (
time_id INT,
date DATE,
year INT,
quarter INT,
month INT,
day INT,
day_of_week INT,
is_weekend BOOLEAN,
is_holiday BOOLEAN
) STORED AS ORC;
-- creation销售事实表
CREATE TABLE dw_sales_fact (
order_id STRING,
time_id INT,
product_id STRING,
store_id STRING,
customer_id STRING,
sales_channel STRING, -- 销售渠道: online/offline
quantity INT,
amount DECIMAL(10,2),
payment_method STRING
) PARTITIONED BY (year INT, month INT) -- 按年月partition
CLUSTERED BY (product_id) INTO 8 BUCKETS -- 按产品ID分桶
STORED AS ORC;
2.3.3 ETL流程
-- 1. 加载产品维度data
INSERT INTO dw_product_dim
SELECT
product_id,
product_name,
category_id,
category_name,
brand,
price,
status,
FROM_UNIXTIME(UNIX_TIMESTAMP(create_time, 'yyyy-MM-dd HH:mm:ss')) AS create_time
FROM ods_products
WHERE dt = '2025-01-01';
-- 2. merge线 on 线 under 销售data to 事实表
INSERT INTO dw_sales_fact PARTITION(year=2025, month=1)
-- 线 on 销售data
SELECT
order_id,
t.time_id,
product_id,
'online' AS store_id,
user_id AS customer_id,
'online' AS sales_channel,
quantity,
amount,
payment_method
FROM ods_online_sales o
JOIN dw_time_dim t ON t.date = TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(o.order_time, 'yyyy-MM-dd HH:mm:ss')))
WHERE o.dt = '2025-01-01'
UNION ALL
-- 线 under 销售data
SELECT
transaction_id AS order_id,
t.time_id,
product_id,
store_id,
customer_id,
'offline' AS sales_channel,
quantity,
amount,
'cash' AS payment_method
FROM ods_offline_sales o
JOIN dw_time_dim t ON t.date = TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(o.transaction_time, 'yyyy-MM-dd HH:mm:ss')))
WHERE o.dt = '2025-01-01';
2.3.4 creationDM层表
-- creation销售汇总表
CREATE TABLE dm_sales_summary (
product_id STRING,
product_name STRING,
category_name STRING,
year INT,
month INT,
total_sales DECIMAL(10,2),
total_quantity INT,
avg_price DECIMAL(10,2),
online_sales DECIMAL(10,2),
offline_sales DECIMAL(10,2)
) PARTITIONED BY (dt STRING)
STORED AS ORC;
-- 插入销售汇总data
INSERT INTO dm_sales_summary PARTITION(dt='2025-01-02')
SELECT
p.product_id,
p.product_name,
p.category_name,
s.year,
s.month,
SUM(s.amount) AS total_sales,
SUM(s.quantity) AS total_quantity,
AVG(p.price) AS avg_price,
SUM(CASE WHEN s.sales_channel = 'online' THEN s.amount ELSE 0 END) AS online_sales,
SUM(CASE WHEN s.sales_channel = 'offline' THEN s.amount ELSE 0 END) AS offline_sales
FROM dw_sales_fact s
JOIN dw_product_dim p ON s.product_id = p.product_id
WHERE s.year = 2025 AND s.month = 1
GROUP BY p.product_id, p.product_name, p.category_name, s.year, s.month;
3. Hivebest practicessummarized
3.1 datastorebest practices
- 选择合适 file格式:
- 原始data: TextFile
- in间data: ORC/Parquet (列式store, 压缩比 high )
- 频繁query data: ORC (queryperformance更 good )
- 合理designpartition:
- 按时间partition (年/月/日)
- 按业务维度partition (such as地区, 产品class别)
- 控制partition数量, 避免partition爆炸
- 适当分桶:
- for 频繁JOIN 列分桶
- 桶数量建议 for 2-4倍 clusternode数
- 启用压缩:
- in间结果: Snappy (压缩/解压速度 fast )
- 最终结果: ORC默认压缩 or Zstandard (压缩比 high )
3.2 SQLwritingbest practices
- 避免全表扫描: usingWHERE子句filterdata, 利用partition裁剪
- small 表JOIN big 表: 将 small 表放 in JOIN left 侧, 利用MapJoinoptimization
- 合理using子query: 避免嵌套过深, 将 complex query拆分 for many 个步骤
- usingCTE (Common Table Expressions) : improvingquery readable 性 and reusability
- 避免 in WHERE子句inusingfunction: 将functionapplication in JOIN or SELECT子句in
- usingLIMIToptimization: for 于只需要 few 量结果 query, usingLIMIT限制返回行数
3.3 performanceoptimizationbest practices
- 启用向量化query: improvingquery执行efficiency
- parallel执行: 启用 many 个taskparallel执行
- using本地模式: for 于 small data集, using本地模式improving执行速度
- optimizationJOIN策略:
- MapJoin: small 表JOIN big 表
- Bucket Join: 相同分桶列 表JOIN
- Sort Merge Bucket Join: sort分桶表JOIN
- usingCBO (Cost-Based optimizationr) : 让Hive自动选择最优执行计划
3.4 dataqualitybest practices
- data清洗: 去除脏data, 重复data and exception值
- datavalidation: verificationdata integrity and 准确性
- 元datamanagement: maintenancedatadictionary and data血缘relationships
- datamonitor: 定期monitordataquality, 设置告警mechanism
- dataaudit: 记录data modify and 访问log
4. commonissues and solution
4.1 data倾斜issues
issues: 某些Reducerprocessing data量过 big , 导致job执行时间过 long .
solution:
- for 热点值for加盐processing
- 将null转换 for 随机值
- usingMapJoin避免Reduce阶段
- 调整Reducer数量
- 启用SkewJoinoptimization
4.2 memory不足issues
issues: job执行过程in出现OOM (Out of Memory) error.
solution:
- 增加Map/Reducetask memory分配
- reducing单个taskprocessing data量
- optimizationquery语句, reducingdataprocessing量
- using合适 file格式 and 压缩方式
4.3 queryperformanceissues
issues: query执行时间过 long , 影响user体验.
solution:
- optimizationdatastore (partition, 分桶, 压缩)
- optimizationSQL语句
- using合适 JOIN策略
- 启用queryoptimization器
- using预aggregate表 or 物化视graph
4.4 dataconsistencyissues
issues: 不同表之间 data不一致, 影响analysis结果.
solution:
- 建立严格 datavalidationmechanism
- usingtransactionmanagement (Hive 0.14+supportACID)
- 实施dataversion控制
- 建立data血缘relationships
- 定期fordata for 账
5. summarized
Hive is big dataecosystemin important data仓librarytool, throughSQLlanguage就能processinglarge-scaledata, reduced big dataanalysis 门槛. in practicalapplicationin, 需要注意以 under 几点:
- 合理designdata仓libraryarchitecture: adopts分层design, improvingdata 可maintenance性 and reusability
- optimizationdatastore: 选择合适 file格式, partition策略 and 压缩方式
- writing high 效 SQL语句: 避免common performance陷阱
- monitor and optimizationjobperformance: 及时发现 and 解决performanceissues
- 重视dataquality: 建立完善 dataqualitymanagement体系
- 持续Learning and 实践: 关注Hive 最 new features and best practices
through本tutorial Learning, 相信您已经Master了Hive basicconcepts, architecture and usingmethod, able tousingHive构建data仓library and fordataanalysis. in practicalapplicationin, 需要根据具体业务场景 and data规模, flexible运用所学knowledge, continuouslyoptimization and improvementHive using方式.
Hive 发展非常迅速, new features and optimizationcontinuously推出, 建议您持续关注Hive 官方documentation and community动态, continuously提升自己 技能水平.