Hive data仓librarydesign

Learningdata仓librarydesignprinciples, including星型model and 雪花model, 以及such as何usingHive构建data仓library

1. data仓libraryoverview

1.1 what is data仓library

data仓library (data warehouse, 简称DW) is a 面向主题 , 集成 , 相 for stable , 反映history变化 datacollection, 用于supportmanagement决策.

1.2 data仓library 特点

  • 面向主题: data仓library围绕企业 core业务主题组织data, such as销售, 客户, 产品etc.
  • 集成性: data仓library将来自不同datasources datafor集成 and 整合
  • stable 性: data仓libraryin data一旦加载, 通常不会被modify, 只用于query and analysis
  • 反映history变化: data仓librarystorehistorydata, support时间序列analysis
  • support决策analysis: data仓library datastructure and store方式optimization了query and analysisperformance

1.3 data仓library and datalibrary 区别

features data仓library datalibrary
design目标 support决策analysis supporttransactionprocessing
datamodel 星型model, 雪花model relationshipsmodel, 范式design
dataupdate 批量加载, 很 few update 频繁 增删改查
data粒度 汇总data, many 种粒度 详细data, 单一粒度
data量 large-scale, TB to PB级 inetc.规模, GB to TB级
queryclass型 complex query, OLAP simple query, OLTP

2. data仓libraryarchitecture

2.1 经典data仓libraryarchitecture

经典 data仓libraryarchitectureincluding以 under 层次:

  1. datasources层: including企业 in 部 业务system, logsystem, out 部dataetc.
  2. data集成层: 负责data 抽取, 转换 and 加载 (ETL)
  3. datastore层: storestructure化 and 半structure化 data
  4. dataservice层: providingdata访问 and queryservice
  5. dataanalysis层: includingOLAPanalysis, data挖掘, report生成etc.
  6. data展现层: through仪表盘, reportetc.方式展示dataanalysis结果

2.2 基于Hive data仓libraryarchitecture

基于Hive data仓libraryarchitecture充分利用了Hadoopecosystem 优势:

  • datasources层: 业务system, log, 传感器dataetc.
  • data集成层: usingSqoop, Flume, Kafkaetc.toolfordata采集 and 集成
  • datastore层:
    • 原始data层 (ODS) : store原始data, 保持data原貌
    • data仓library层 (DW) : store经过清洗 and 整合 data, adopts星型 or 雪花model
    • data集市层 (DM) : 针 for specific业务主题 datacollection
    • dataapplication层 (DA) : 用于specificapplication datacollection
  • dataservice层: Hive, Spark SQL, Prestoetc.query引擎
  • dataanalysis层: Spark MLlib, Hivemalletc.机器Learninglibrary
  • data展现层: Tableau, Power BI, Supersetetc.visualizationtool

3. data仓librarymodeldesign

3.1 conceptsmodeldesign

conceptsmodeldesign is data仓librarydesign Step 1, 主要确定data仓library 主题 and 范围:

  • 确定业务主题: 根据企业 core业务确定data仓library 主题, such as销售, 客户, 产品etc.
  • 定义实体relationships: 确定主题之间 relationships and 依赖
  • 划分data粒度: 确定data 详细程度, such as日粒度, 月粒度etc.

3.2 逻辑modeldesign

逻辑modeldesign is data仓librarydesign core, 主要including以 under 两种model:

3.2.1 星型model

星型model is data仓libraryin最常用 model, 由一个事实表 and many 个维度表组成:

  • 事实表: store业务度量data, such as销售额, 订单数量etc.
  • 维度表: storedescribes性data, such as时间, 产品, 客户etc.
  • 特点: structure simple , queryperformance good , 适合OLAPanalysis

星型modelexample: 销售analysis

事实表: sales_fact

sales_fact (
    order_id INT,         -- 订单ID
    time_id INT,          -- 时间维度 out 键
    product_id INT,       -- 产品维度 out 键
    customer_id INT,      -- 客户维度 out 键
    store_id INT,         -- 商店维度 out 键
    sales_amount DECIMAL(10,2),  -- 销售额
    order_quantity INT    -- 订单数量
)

维度表: time_dim, product_dim, customer_dim, store_dim

-- 时间维度表
time_dim (
    time_id INT,          -- 时间ID
    date DATE,            -- 日期
    year INT,             -- 年份
    quarter INT,          -- 季度
    month INT,            -- 月份
    day INT,              -- 日期
    day_of_week INT,      -- 星期几
    is_weekend BOOLEAN    --  is 否周末
)

-- 产品维度表
product_dim (
    product_id INT,       -- 产品ID
    product_name STRING,  -- 产品名称
    category STRING,      -- 产品class别
    brand STRING,         -- 品牌
    price DECIMAL(10,2)   -- 价格
)

-- 客户维度表
customer_dim (
    customer_id INT,      -- 客户ID
    customer_name STRING, -- 客户名称
    gender STRING,        -- 性别
    age INT,              -- 年龄
    city STRING,          -- 城市
    region STRING         -- 地区
)

-- 商店维度表
store_dim (
    store_id INT,         -- 商店ID
    store_name STRING,    -- 商店名称
    city STRING,          -- 城市
    region STRING         -- 地区
)

3.2.2 雪花model

雪花model is 星型model scale, 维度表可以进一步关联 to other维度表:

  • 特点: structure更 complex , data冗余 few , maintenance成本 high
  • 适用场景: 维度表之间存 in 明显 层级relationships, 需要reducingdata冗余

雪花modelexample: 销售analysis

in 雪花modelin, 地区维度可以进一步拆分 for 国家, 省份, 城市etc.层级:

-- 国家维度表
country_dim (
    country_id INT,       -- 国家ID
    country_name STRING   -- 国家名称
)

-- 省份维度表
province_dim (
    province_id INT,      -- 省份ID
    province_name STRING, -- 省份名称
    country_id INT        -- 国家ID ( out 键) 
)

-- 城市维度表
city_dim (
    city_id INT,          -- 城市ID
    city_name STRING,     -- 城市名称
    province_id INT       -- 省份ID ( out 键) 
)

-- 客户维度表 (雪花model) 
customer_dim (
    customer_id INT,      -- 客户ID
    customer_name STRING, -- 客户名称
    gender STRING,        -- 性别
    age INT,              -- 年龄
    city_id INT,          -- 城市ID ( out 键) 
    region STRING         -- 地区
)

3.3 物理modeldesign

物理modeldesign is 将逻辑model转换 for practical datastorestructure:

  • 选择store格式: ORC, Parquetetc.列式store格式
  • designpartition策略: 按时间, 地区etc.维度forpartition
  • design分桶策略: for 频繁JOIN 列for分桶
  • designindex: for 频繁query 列creationindex
  • 确定压缩方式: Snappy, Zstandardetc.压缩algorithms

4. data仓library分层design

4.1 分层design 优势

  • data隔离: 不同层次 data相互隔离, 便于maintenance and management
  • improvingreusability: on 层data可以复用 under 层data, reducing重复Development
  • 便于debug and issues定位: 可以逐层checkdataquality and issues
  • support many 种dataapplication: 不同层次 data可以support不同 application场景
  • improvingqueryperformance: through预计算 and dataaggregate, improvingqueryefficiency

4.2 典型分层design

4.2.1 原始data层 (ODS)

原始data层store from datasources直接抽取 data, 保持data 原始格式 and in 容:

  • data特点: 原始, 未经过processing, 可能package含脏data
  • store方式: 通常using and datasources相同 格式store
  • data保留: 根据业务requirements保留一定时间 原始data

4.2.2 data仓library层 (DW)

data仓library层store经过清洗, 整合 and 转换 data, adopts星型 or 雪花model:

  • data特点: 干净, 整合, structure化
  • designmodel: 星型model or 雪花model
  • data粒度: 细粒度, support many 种analysisrequirements

4.2.3 data集市层 (DM)

data集市层针 for specific业务主题 datacollection, providing更聚焦 data:

  • data特点: 面向specific业务主题, 汇总data
  • designmodel: 通常adopts星型model
  • data粒度: 较粗粒度, 针 for specificanalysisrequirementsoptimization

4.2.4 dataapplication层 (DA)

dataapplication层store for specificapplication定制 data:

  • data特点: high 度定制, 针 for specificapplicationoptimization
  • designmodel: 根据applicationrequirementsdesign
  • data粒度: 根据applicationrequirements确定

5. usingHive构建data仓library

5.1 Hivedata仓librarydesign步骤

  1. requirementsanalysis: Understand业务requirements and analysisrequirements
  2. conceptsmodeldesign: 确定业务主题 and 实体relationships
  3. 逻辑modeldesign: design星型 or 雪花model
  4. 物理modeldesign: 选择store格式, partition策略etc.
  5. ETLdesign: designdata抽取, 转换 and 加载流程
  6. data加载: 将data加载 to data仓libraryin
  7. queryoptimization: optimizationqueryperformance
  8. maintenance and management: 定期maintenance and updatedata仓library

5.2 creation分层表structure

5.2.1 creationODS层表

-- creation原始data层表
CREATE TABLE ods_sales (
    order_id INT,
    order_date STRING,
    product_id INT,
    customer_id INT,
    store_id INT,
    sales_amount DECIMAL(10,2),
    order_quantity INT
) PARTITIONED BY (dt STRING) -- 按日期partition
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

-- creation原始log表
CREATE TABLE ods_logs (
    ip STRING,
    user_id INT,
    action STRING,
    url STRING,
    timestamp BIGINT
) PARTITIONED BY (dt STRING) -- 按日期partition
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
STORED AS TEXTFILE;

5.2.2 creationDW层表

-- creation时间维度表
CREATE TABLE dw_time_dim (
    time_id INT,
    date DATE,
    year INT,
    quarter INT,
    month INT,
    day INT,
    day_of_week INT,
    is_weekend BOOLEAN
) STORED AS ORC;

-- creation产品维度表
CREATE TABLE dw_product_dim (
    product_id INT,
    product_name STRING,
    category STRING,
    brand STRING,
    price DECIMAL(10,2)
) STORED AS ORC;

-- creation客户维度表
CREATE TABLE dw_customer_dim (
    customer_id INT,
    customer_name STRING,
    gender STRING,
    age INT,
    city STRING,
    region STRING
) STORED AS ORC;

-- creation商店维度表
CREATE TABLE dw_store_dim (
    store_id INT,
    store_name STRING,
    city STRING,
    region STRING
) STORED AS ORC;

-- creation销售事实表
CREATE TABLE dw_sales_fact (
    order_id INT,
    time_id INT,
    product_id INT,
    customer_id INT,
    store_id INT,
    sales_amount DECIMAL(10,2),
    order_quantity INT
) PARTITIONED BY (year INT, month INT) -- 按年月partition
CLUSTERED BY (product_id) INTO 8 BUCKETS -- 按产品ID分桶
STORED AS ORC;

5.2.3 creationDM层表

-- creation销售汇总表 (data集市) 
CREATE TABLE dm_sales_summary (
    product_id INT,
    product_name STRING,
    category STRING,
    year INT,
    month INT,
    total_sales DECIMAL(10,2),
    total_quantity INT,
    avg_price DECIMAL(10,2)
) PARTITIONED BY (dt STRING) -- 按日期partition
STORED AS ORC;

5.3 ETL流程design

ETL (Extract, Transform, Load) is data仓library建设 core流程:

5.3.1 data抽取

from datasources抽取data to ODS层:

-- usingSqoop from relationshipsdatalibrary抽取data to Hive
 sqoop import \
    --connect jdbc:mysql://localhost:3306/sales_db \
    --username root \
    --password password \
    --table sales \
    --hive-import \
    --hive-table ods_sales \
    --hive-partition-key dt \
    --hive-partition-value "2025-01-01" \
    --fields-terminated-by '\t' \
    --m 1

-- usingFlume采集logdata to Hive

5.3.2 data转换

将ODS层data转换 for DW层data:

-- 转换销售data to 事实表
INSERT INTO dw_sales_fact PARTITION(year=2025, month=1)
SELECT 
    s.order_id,
    t.time_id,
    s.product_id,
    s.customer_id,
    s.store_id,
    s.sales_amount,
    s.order_quantity
FROM ods_sales s
JOIN dw_time_dim t ON t.date = TO_DATE(s.order_date)
WHERE s.dt = '2025-01-01';

-- 转换data to 销售汇总表
INSERT INTO dm_sales_summary PARTITION(dt='2025-01-02')
SELECT 
    p.product_id,
    p.product_name,
    p.category,
    t.year,
    t.month,
    SUM(f.sales_amount) AS total_sales,
    SUM(f.order_quantity) AS total_quantity,
    AVG(p.price) AS avg_price
FROM dw_sales_fact f
JOIN dw_product_dim p ON f.product_id = p.product_id
JOIN dw_time_dim t ON f.time_id = t.time_id
WHERE f.year = 2025 AND f.month = 1
GROUP BY p.product_id, p.product_name, p.category, t.year, t.month;

6. data仓librarybest practices

6.1 datamodeldesignbest practices

  • 优先选择星型model: 星型modelstructure simple , queryperformance good , maintenance成本 low
  • 合理design维度表:
    • 维度表应该package含describes性property
    • 维度表应该 has 唯一 主键
    • 维度表应该package含historyversion, support缓 slow 变化维 (SCD)
  • 合理design事实表:
    • 事实表应该package含业务度量
    • 事实表应该package含 out 键引用维度表
    • 事实表应该按时间etc.维度partition
  • support缓 slow 变化维 (SCD) :
    • Type 1: 直接update, 不保留history
    • Type 2: 添加 new 记录, 保留完整history
    • Type 3: 添加 new 列, 保留部分history

6.2 dataqualitybest practices

  • data清洗: 去除脏data, 重复data and exception值
  • datavalidation: verificationdata integrity, 准确性 and consistency
  • datamonitor: 定期monitordataquality, 及时发现 and 解决issues
  • data血缘management: 跟踪data 来sources and 流向
  • 元datamanagement: maintenancedata 定义, structure and relationships

6.3 performanceoptimizationbest practices

  • using列式store格式: ORC, Parquetetc.列式store格式
  • 合理designpartition and 分桶: 按query频率 high 列partition, for 频繁JOIN 列分桶
  • using预aggregate表: 提 before 计算汇总data, reducingquery时 计算量
  • using物化视graph: for 于频繁query 视graph, 考虑using物化视graph
  • optimizationquery语句: 避免全表扫描, using合适 JOINclass型

6.4 datamanagementbest practices

  • 建立data生命周期management策略: 定期归档 and clean过期data
  • 建立datasecuritymanagement策略: 控制data访问permission, 保护敏感data
  • 建立databackup and restore策略: 定期backupdata, 确保data可restore
  • 建立dataversionmanagement策略: managementdatamodel and ETL流程 version
  • 建立data治理体系: 确保data quality, security and compliance性

7. data仓librarycaseanalysis

7.1 电商data仓librarycase

电商data仓library通常package含以 under 主题:

  • 销售analysis: 销售额, 订单数量, 客单价etc.
  • useranalysis: user活跃度, user留存率, userbehavioretc.
  • 产品analysis: 产品销售排名, library存circumstances, 转化率etc.
  • 营销analysis: 营销活动效果, 优惠券usingcircumstancesetc.
  • 供应链analysis: library存周转率, 物流efficiencyetc.

7.2 loganalysisdata仓librarycase

loganalysisdata仓library通常package含以 under 主题:

  • 访问analysis: PV, UV, 访问时 long etc.
  • behavioranalysis: user点击path, 页面转化率etc.
  • performanceanalysis: 页面加载时间, serverresponse时间etc.
  • erroranalysis: error率, errorclass型分布etc.
  • 来sourcesanalysis: traffic来sources, 渠道效果etc.

8. summarized

data仓librarydesign is a complex 过程, 需要考虑业务requirements, datamodel, storestructure, performanceoptimizationetc. many 个方面. 基于Hive data仓librarydesign充分利用了Hadoopecosystem 优势, able toprocessinglarge-scaledata, support complex query and analysis.

in designdata仓library时, 需要遵循以 under principles:

  1. understanding业务requirements, 明确data仓library 目标 and 范围
  2. adopts分层design, improvingdata 可maintenance性 and reusability
  3. 优先选择星型model, improvingqueryperformance
  4. 合理designpartition and 分桶策略, optimizationstore and queryperformance
  5. 重视dataquality, 建立dataqualitymonitor and management体系
  6. 持续optimization, 根据业务requirements and data量调整design

through合理 design and optimization, 基于Hive data仓library可以 for 企业providing强 big dataanalysis and 决策supportcapacity, helping企业 from datain挖掘value, 提升竞争力.