1. data仓libraryoverview
1.1 what is data仓library
data仓library (data warehouse, 简称DW) is a 面向主题 , 集成 , 相 for stable , 反映history变化 datacollection, 用于supportmanagement决策.
1.2 data仓library 特点
- 面向主题: data仓library围绕企业 core业务主题组织data, such as销售, 客户, 产品etc.
- 集成性: data仓library将来自不同datasources datafor集成 and 整合
- stable 性: data仓libraryin data一旦加载, 通常不会被modify, 只用于query and analysis
- 反映history变化: data仓librarystorehistorydata, support时间序列analysis
- support决策analysis: data仓library datastructure and store方式optimization了query and analysisperformance
1.3 data仓library and datalibrary 区别
| features | data仓library | datalibrary |
|---|---|---|
| design目标 | support决策analysis | supporttransactionprocessing |
| datamodel | 星型model, 雪花model | relationshipsmodel, 范式design |
| dataupdate | 批量加载, 很 few update | 频繁 增删改查 |
| data粒度 | 汇总data, many 种粒度 | 详细data, 单一粒度 |
| data量 | large-scale, TB to PB级 | inetc.规模, GB to TB级 |
| queryclass型 | complex query, OLAP | simple query, OLTP |
2. data仓libraryarchitecture
2.1 经典data仓libraryarchitecture
经典 data仓libraryarchitectureincluding以 under 层次:
- datasources层: including企业 in 部 业务system, logsystem, out 部dataetc.
- data集成层: 负责data 抽取, 转换 and 加载 (ETL)
- datastore层: storestructure化 and 半structure化 data
- dataservice层: providingdata访问 and queryservice
- dataanalysis层: includingOLAPanalysis, data挖掘, report生成etc.
- data展现层: through仪表盘, reportetc.方式展示dataanalysis结果
2.2 基于Hive data仓libraryarchitecture
基于Hive data仓libraryarchitecture充分利用了Hadoopecosystem 优势:
- datasources层: 业务system, log, 传感器dataetc.
- data集成层: usingSqoop, Flume, Kafkaetc.toolfordata采集 and 集成
- datastore层:
- 原始data层 (ODS) : store原始data, 保持data原貌
- data仓library层 (DW) : store经过清洗 and 整合 data, adopts星型 or 雪花model
- data集市层 (DM) : 针 for specific业务主题 datacollection
- dataapplication层 (DA) : 用于specificapplication datacollection
- dataservice层: Hive, Spark SQL, Prestoetc.query引擎
- dataanalysis层: Spark MLlib, Hivemalletc.机器Learninglibrary
- data展现层: Tableau, Power BI, Supersetetc.visualizationtool
3. data仓librarymodeldesign
3.1 conceptsmodeldesign
conceptsmodeldesign is data仓librarydesign Step 1, 主要确定data仓library 主题 and 范围:
- 确定业务主题: 根据企业 core业务确定data仓library 主题, such as销售, 客户, 产品etc.
- 定义实体relationships: 确定主题之间 relationships and 依赖
- 划分data粒度: 确定data 详细程度, such as日粒度, 月粒度etc.
3.2 逻辑modeldesign
逻辑modeldesign is data仓librarydesign core, 主要including以 under 两种model:
3.2.1 星型model
星型model is data仓libraryin最常用 model, 由一个事实表 and many 个维度表组成:
- 事实表: store业务度量data, such as销售额, 订单数量etc.
- 维度表: storedescribes性data, such as时间, 产品, 客户etc.
- 特点: structure simple , queryperformance good , 适合OLAPanalysis
星型modelexample: 销售analysis
事实表: sales_fact
sales_fact (
order_id INT, -- 订单ID
time_id INT, -- 时间维度 out 键
product_id INT, -- 产品维度 out 键
customer_id INT, -- 客户维度 out 键
store_id INT, -- 商店维度 out 键
sales_amount DECIMAL(10,2), -- 销售额
order_quantity INT -- 订单数量
)
维度表: time_dim, product_dim, customer_dim, store_dim
-- 时间维度表
time_dim (
time_id INT, -- 时间ID
date DATE, -- 日期
year INT, -- 年份
quarter INT, -- 季度
month INT, -- 月份
day INT, -- 日期
day_of_week INT, -- 星期几
is_weekend BOOLEAN -- is 否周末
)
-- 产品维度表
product_dim (
product_id INT, -- 产品ID
product_name STRING, -- 产品名称
category STRING, -- 产品class别
brand STRING, -- 品牌
price DECIMAL(10,2) -- 价格
)
-- 客户维度表
customer_dim (
customer_id INT, -- 客户ID
customer_name STRING, -- 客户名称
gender STRING, -- 性别
age INT, -- 年龄
city STRING, -- 城市
region STRING -- 地区
)
-- 商店维度表
store_dim (
store_id INT, -- 商店ID
store_name STRING, -- 商店名称
city STRING, -- 城市
region STRING -- 地区
)
3.2.2 雪花model
雪花model is 星型model scale, 维度表可以进一步关联 to other维度表:
- 特点: structure更 complex , data冗余 few , maintenance成本 high
- 适用场景: 维度表之间存 in 明显 层级relationships, 需要reducingdata冗余
雪花modelexample: 销售analysis
in 雪花modelin, 地区维度可以进一步拆分 for 国家, 省份, 城市etc.层级:
-- 国家维度表
country_dim (
country_id INT, -- 国家ID
country_name STRING -- 国家名称
)
-- 省份维度表
province_dim (
province_id INT, -- 省份ID
province_name STRING, -- 省份名称
country_id INT -- 国家ID ( out 键)
)
-- 城市维度表
city_dim (
city_id INT, -- 城市ID
city_name STRING, -- 城市名称
province_id INT -- 省份ID ( out 键)
)
-- 客户维度表 (雪花model)
customer_dim (
customer_id INT, -- 客户ID
customer_name STRING, -- 客户名称
gender STRING, -- 性别
age INT, -- 年龄
city_id INT, -- 城市ID ( out 键)
region STRING -- 地区
)
3.3 物理modeldesign
物理modeldesign is 将逻辑model转换 for practical datastorestructure:
- 选择store格式: ORC, Parquetetc.列式store格式
- designpartition策略: 按时间, 地区etc.维度forpartition
- design分桶策略: for 频繁JOIN 列for分桶
- designindex: for 频繁query 列creationindex
- 确定压缩方式: Snappy, Zstandardetc.压缩algorithms
4. data仓library分层design
4.1 分层design 优势
- data隔离: 不同层次 data相互隔离, 便于maintenance and management
- improvingreusability: on 层data可以复用 under 层data, reducing重复Development
- 便于debug and issues定位: 可以逐层checkdataquality and issues
- support many 种dataapplication: 不同层次 data可以support不同 application场景
- improvingqueryperformance: through预计算 and dataaggregate, improvingqueryefficiency
4.2 典型分层design
4.2.1 原始data层 (ODS)
原始data层store from datasources直接抽取 data, 保持data 原始格式 and in 容:
- data特点: 原始, 未经过processing, 可能package含脏data
- store方式: 通常using and datasources相同 格式store
- data保留: 根据业务requirements保留一定时间 原始data
4.2.2 data仓library层 (DW)
data仓library层store经过清洗, 整合 and 转换 data, adopts星型 or 雪花model:
- data特点: 干净, 整合, structure化
- designmodel: 星型model or 雪花model
- data粒度: 细粒度, support many 种analysisrequirements
4.2.3 data集市层 (DM)
data集市层针 for specific业务主题 datacollection, providing更聚焦 data:
- data特点: 面向specific业务主题, 汇总data
- designmodel: 通常adopts星型model
- data粒度: 较粗粒度, 针 for specificanalysisrequirementsoptimization
4.2.4 dataapplication层 (DA)
dataapplication层store for specificapplication定制 data:
- data特点: high 度定制, 针 for specificapplicationoptimization
- designmodel: 根据applicationrequirementsdesign
- data粒度: 根据applicationrequirements确定
5. usingHive构建data仓library
5.1 Hivedata仓librarydesign步骤
- requirementsanalysis: Understand业务requirements and analysisrequirements
- conceptsmodeldesign: 确定业务主题 and 实体relationships
- 逻辑modeldesign: design星型 or 雪花model
- 物理modeldesign: 选择store格式, partition策略etc.
- ETLdesign: designdata抽取, 转换 and 加载流程
- data加载: 将data加载 to data仓libraryin
- queryoptimization: optimizationqueryperformance
- maintenance and management: 定期maintenance and updatedata仓library
5.2 creation分层表structure
5.2.1 creationODS层表
-- creation原始data层表
CREATE TABLE ods_sales (
order_id INT,
order_date STRING,
product_id INT,
customer_id INT,
store_id INT,
sales_amount DECIMAL(10,2),
order_quantity INT
) PARTITIONED BY (dt STRING) -- 按日期partition
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
-- creation原始log表
CREATE TABLE ods_logs (
ip STRING,
user_id INT,
action STRING,
url STRING,
timestamp BIGINT
) PARTITIONED BY (dt STRING) -- 按日期partition
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
STORED AS TEXTFILE;
5.2.2 creationDW层表
-- creation时间维度表
CREATE TABLE dw_time_dim (
time_id INT,
date DATE,
year INT,
quarter INT,
month INT,
day INT,
day_of_week INT,
is_weekend BOOLEAN
) STORED AS ORC;
-- creation产品维度表
CREATE TABLE dw_product_dim (
product_id INT,
product_name STRING,
category STRING,
brand STRING,
price DECIMAL(10,2)
) STORED AS ORC;
-- creation客户维度表
CREATE TABLE dw_customer_dim (
customer_id INT,
customer_name STRING,
gender STRING,
age INT,
city STRING,
region STRING
) STORED AS ORC;
-- creation商店维度表
CREATE TABLE dw_store_dim (
store_id INT,
store_name STRING,
city STRING,
region STRING
) STORED AS ORC;
-- creation销售事实表
CREATE TABLE dw_sales_fact (
order_id INT,
time_id INT,
product_id INT,
customer_id INT,
store_id INT,
sales_amount DECIMAL(10,2),
order_quantity INT
) PARTITIONED BY (year INT, month INT) -- 按年月partition
CLUSTERED BY (product_id) INTO 8 BUCKETS -- 按产品ID分桶
STORED AS ORC;
5.2.3 creationDM层表
-- creation销售汇总表 (data集市)
CREATE TABLE dm_sales_summary (
product_id INT,
product_name STRING,
category STRING,
year INT,
month INT,
total_sales DECIMAL(10,2),
total_quantity INT,
avg_price DECIMAL(10,2)
) PARTITIONED BY (dt STRING) -- 按日期partition
STORED AS ORC;
5.3 ETL流程design
ETL (Extract, Transform, Load) is data仓library建设 core流程:
5.3.1 data抽取
from datasources抽取data to ODS层:
-- usingSqoop from relationshipsdatalibrary抽取data to Hive
sqoop import \
--connect jdbc:mysql://localhost:3306/sales_db \
--username root \
--password password \
--table sales \
--hive-import \
--hive-table ods_sales \
--hive-partition-key dt \
--hive-partition-value "2025-01-01" \
--fields-terminated-by '\t' \
--m 1
-- usingFlume采集logdata to Hive
5.3.2 data转换
将ODS层data转换 for DW层data:
-- 转换销售data to 事实表
INSERT INTO dw_sales_fact PARTITION(year=2025, month=1)
SELECT
s.order_id,
t.time_id,
s.product_id,
s.customer_id,
s.store_id,
s.sales_amount,
s.order_quantity
FROM ods_sales s
JOIN dw_time_dim t ON t.date = TO_DATE(s.order_date)
WHERE s.dt = '2025-01-01';
-- 转换data to 销售汇总表
INSERT INTO dm_sales_summary PARTITION(dt='2025-01-02')
SELECT
p.product_id,
p.product_name,
p.category,
t.year,
t.month,
SUM(f.sales_amount) AS total_sales,
SUM(f.order_quantity) AS total_quantity,
AVG(p.price) AS avg_price
FROM dw_sales_fact f
JOIN dw_product_dim p ON f.product_id = p.product_id
JOIN dw_time_dim t ON f.time_id = t.time_id
WHERE f.year = 2025 AND f.month = 1
GROUP BY p.product_id, p.product_name, p.category, t.year, t.month;
6. data仓librarybest practices
6.1 datamodeldesignbest practices
- 优先选择星型model: 星型modelstructure simple , queryperformance good , maintenance成本 low
- 合理design维度表:
- 维度表应该package含describes性property
- 维度表应该 has 唯一 主键
- 维度表应该package含historyversion, support缓 slow 变化维 (SCD)
- 合理design事实表:
- 事实表应该package含业务度量
- 事实表应该package含 out 键引用维度表
- 事实表应该按时间etc.维度partition
- support缓 slow 变化维 (SCD) :
- Type 1: 直接update, 不保留history
- Type 2: 添加 new 记录, 保留完整history
- Type 3: 添加 new 列, 保留部分history
6.2 dataqualitybest practices
- data清洗: 去除脏data, 重复data and exception值
- datavalidation: verificationdata integrity, 准确性 and consistency
- datamonitor: 定期monitordataquality, 及时发现 and 解决issues
- data血缘management: 跟踪data 来sources and 流向
- 元datamanagement: maintenancedata 定义, structure and relationships
6.3 performanceoptimizationbest practices
- using列式store格式: ORC, Parquetetc.列式store格式
- 合理designpartition and 分桶: 按query频率 high 列partition, for 频繁JOIN 列分桶
- using预aggregate表: 提 before 计算汇总data, reducingquery时 计算量
- using物化视graph: for 于频繁query 视graph, 考虑using物化视graph
- optimizationquery语句: 避免全表扫描, using合适 JOINclass型
6.4 datamanagementbest practices
- 建立data生命周期management策略: 定期归档 and clean过期data
- 建立datasecuritymanagement策略: 控制data访问permission, 保护敏感data
- 建立databackup and restore策略: 定期backupdata, 确保data可restore
- 建立dataversionmanagement策略: managementdatamodel and ETL流程 version
- 建立data治理体系: 确保data quality, security and compliance性
7. data仓librarycaseanalysis
7.1 电商data仓librarycase
电商data仓library通常package含以 under 主题:
- 销售analysis: 销售额, 订单数量, 客单价etc.
- useranalysis: user活跃度, user留存率, userbehavioretc.
- 产品analysis: 产品销售排名, library存circumstances, 转化率etc.
- 营销analysis: 营销活动效果, 优惠券usingcircumstancesetc.
- 供应链analysis: library存周转率, 物流efficiencyetc.
7.2 loganalysisdata仓librarycase
loganalysisdata仓library通常package含以 under 主题:
- 访问analysis: PV, UV, 访问时 long etc.
- behavioranalysis: user点击path, 页面转化率etc.
- performanceanalysis: 页面加载时间, serverresponse时间etc.
- erroranalysis: error率, errorclass型分布etc.
- 来sourcesanalysis: traffic来sources, 渠道效果etc.
8. summarized
data仓librarydesign is a complex 过程, 需要考虑业务requirements, datamodel, storestructure, performanceoptimizationetc. many 个方面. 基于Hive data仓librarydesign充分利用了Hadoopecosystem 优势, able toprocessinglarge-scaledata, support complex query and analysis.
in designdata仓library时, 需要遵循以 under principles:
- understanding业务requirements, 明确data仓library 目标 and 范围
- adopts分层design, improvingdata 可maintenance性 and reusability
- 优先选择星型model, improvingqueryperformance
- 合理designpartition and 分桶策略, optimizationstore and queryperformance
- 重视dataquality, 建立dataqualitymonitor and management体系
- 持续optimization, 根据业务requirements and data量调整design
through合理 design and optimization, 基于Hive data仓library可以 for 企业providing强 big dataanalysis and 决策supportcapacity, helping企业 from datain挖掘value, 提升竞争力.