Hive dataclass型 and store格式

UnderstandHivesupport dataclass型, 以及不同store格式 特点 and usingmethod

Hive dataclass型 and store格式

1. Hive dataclass型overview

Hivesupport many 种dataclass型, includingbasicdataclass型, complex dataclass型 and in 置functionclass型. 选择合适 dataclass型 for 于optimizationqueryperformance and store空间至关 important .

2. basicdataclass型

Hive basicdataclass型including数值class型, stringclass型, booleanclass型 and 日期时间class型.

dataclass型 describes example
TINYINT 1字节 has 符号整数, 范围: -128 to 127 42, -10
SMALLINT 2字节 has 符号整数, 范围: -32768 to 32767 1000, -500
INT 4字节 has 符号整数, 范围: -2^31 to 2^31-1 100000, -50000
BIGINT 8字节 has 符号整数, 范围: -2^63 to 2^63-1 1000000000L, -5000000000L
FLOAT 4字节单精度浮点数 3.14f, -0.5f
DOUBLE 8字节双精度浮点数 3.1415926, -0.12345
DECIMAL(p,s) high 精度十进制数, p for 总位数, s for small 数位数 DECIMAL(10,2): 12345.67
BOOLEAN boolean值, true or false TRUE, FALSE
STRING 可变 long 度string 'Hello', "World"
VARCHAR(n) 可变 long 度string, 最 big long 度n VARCHAR(50): 'Short string'
CHAR(n) 固定 long 度string, long 度n CHAR(10): 'Fixed '
DATE 日期class型, 格式: YYYY-MM-DD '2025-01-15'
TIMESTAMP 时间戳class型, 格式: YYYY-MM-DD HH:MM:SS.fffffffff '2025-01-15 14:30:45.123456789'

2.1 数值class型example

-- creationpackage含数值class型 表
CREATE TABLE IF NOT EXISTS numeric_types (
    tiny_col TINYINT,
    small_col SMALLINT,
    int_col INT,
    big_col BIGINT,
    float_col FLOAT,
    double_col DOUBLE,
    decimal_col DECIMAL(10,2)
);

-- 插入data
INSERT INTO numeric_types VALUES
(100, 20000, 3000000, 4000000000L, 3.14f, 3.1415926, 12345.67);

-- querydata
SELECT * FROM numeric_types;

2.2 stringclass型example

-- creationpackage含stringclass型 表
CREATE TABLE IF NOT EXISTS string_types (
    string_col STRING,
    varchar_col VARCHAR(50),
    char_col CHAR(10)
);

-- 插入data
INSERT INTO string_types VALUES
('This is a string', 'This is a varchar', 'char fixed');

-- querydata
SELECT * FROM string_types;

-- 查看string long 度
SELECT 
    LENGTH(string_col) AS string_len,
    LENGTH(varchar_col) AS varchar_len,
    LENGTH(char_col) AS char_len
FROM string_types;

2.3 日期时间class型example

-- creationpackage含日期时间class型 表
CREATE TABLE IF NOT EXISTS datetime_types (
    date_col DATE,
    timestamp_col TIMESTAMP
);

-- 插入data
INSERT INTO datetime_types VALUES
('2025-01-15', '2025-01-15 14:30:45.123456789');

-- usingfunction插入当 before 日期 and 时间
INSERT INTO datetime_types VALUES
(CURRENT_DATE(), CURRENT_TIMESTAMP());

-- querydata
SELECT * FROM datetime_types;

-- 日期时间functionusing
SELECT 
    date_col,
    YEAR(date_col) AS year,
    MONTH(date_col) AS month,
    DAY(date_col) AS day,
    timestamp_col,
    HOUR(timestamp_col) AS hour,
    MINUTE(timestamp_col) AS minute,
    SECOND(timestamp_col) AS second
FROM datetime_types;

3. complex dataclass型

Hivesupport四种 complex dataclass型: ARRAY, MAP, STRUCT and UNIONTYPE.

3.1 ARRAYclass型

ARRAYclass型用于store相同class型 元素collection, 元素throughindex访问.

-- creationpackage含ARRAYclass型 表
CREATE TABLE IF NOT EXISTS array_table (
    id INT,
    name STRING,
    hobbies ARRAY<STRING>
);

-- 插入data
INSERT INTO array_table VALUES
(1, 'Alice', ARRAY('Reading', 'Traveling', 'Photography')),
(2, 'Bob', ARRAY('Sports', 'Music')),
(3, 'Charlie', ARRAY('Cooking', 'Gardening', 'Painting'));

-- queryarray long 度
SELECT id, name, hobbies, SIZE(hobbies) AS hobby_count FROM array_table;

-- 访问array元素 (index from 0开始) 
SELECT id, name, hobbies[0] AS first_hobby, hobbies[1] AS second_hobby FROM array_table;

-- checkarray is 否package含specific元素
SELECT id, name FROM array_table WHERE array_contains(hobbies, 'Reading');

-- unfoldarray
SELECT id, name, hobby
FROM array_table
LATERAL VIEW explode(hobbies) h AS hobby;

3.2 MAPclass型

MAPclass型用于store键值 for collection, 键 and 值可以 is 任意class型.

-- creationpackage含MAPclass型 表
CREATE TABLE IF NOT EXISTS map_table (
    id INT,
    name STRING,
    scores MAP<STRING, INT>
);

-- 插入data
INSERT INTO map_table VALUES
(1, 'Alice', MAP('Math', 90, 'English', 85, 'Science', 95)),
(2, 'Bob', MAP('Math', 80, 'History', 75)),
(3, 'Charlie', MAP('English', 88, 'Science', 92, 'Art', 90));

-- 访问MAP值
SELECT id, name, scores['Math'] AS math_score, scores['English'] AS english_score FROM map_table;

-- checkMAP is 否package含specific键
SELECT id, name FROM map_table WHERE map_contains_key(scores, 'Math');

-- 获取MAP 键 and 值
SELECT id, name, map_keys(scores) AS subjects, map_values(scores) AS score_values FROM map_table;

-- unfoldMAP
SELECT id, name, subject, score
FROM map_table
LATERAL VIEW explode(scores) s AS subject, score;

3.3 STRUCTclass型

STRUCTclass型用于store不同class型 字段collection, 字段through名称访问.

-- creationpackage含STRUCTclass型 表
CREATE TABLE IF NOT EXISTS struct_table (
    id INT,
    name STRING,
    address STRUCT<city:STRING, street:STRING, zip:INT>
);

-- 插入data
INSERT INTO struct_table VALUES
(1, 'Alice', NAMED_STRUCT('city', 'Beijing', 'street', 'Main St', 'zip', 100000)),
(2, 'Bob', NAMED_STRUCT('city', 'Shanghai', 'street', 'Second Ave', 'zip', 200000));

-- 访问STRUCT字段
SELECT id, name, address.city, address.street, address.zip FROM struct_table;

3.4 UNIONTYPEclass型

UNIONTYPEclass型用于store不同class型 值, 但每次只能store一种class型 值.

-- creationpackage含UNIONTYPEclass型 表
CREATE TABLE IF NOT EXISTS union_table (
    id INT,
    union_col UNIONTYPE<INT, STRING, BOOLEAN>
);

-- 插入data (using不同class型) 
INSERT INTO union_table VALUES
(1, CAST(100 AS UNIONTYPE<INT, STRING, BOOLEAN>)),
(2, CAST('string value' AS UNIONTYPE<INT, STRING, BOOLEAN>)),
(3, CAST(TRUE AS UNIONTYPE<INT, STRING, BOOLEAN>));

-- queryUNIONTYPE值
SELECT id, union_col FROM union_table;

-- 获取UNIONTYPE class型 and 值
SELECT 
    id,
    union_col,
    unión_tag(union_col) AS tag,
    CASE 
        WHEN unión_tag(union_col) = 0 THEN CAST(union_col AS INT)
        WHEN unión_tag(union_col) = 1 THEN CAST(union_col AS STRING)
        WHEN unión_tag(union_col) = 2 THEN CAST(union_col AS BOOLEAN)
    END AS value
FROM union_table;

4. dataclass型转换

Hivesupport显式 and 隐式dataclass型转换.

4.1 隐式转换

Hive会自动for某些dataclass型转换, 例such as:

  • TINYINT → SMALLINT → INT → BIGINT → FLOAT → DOUBLE
  • STRING → DOUBLE (such as果string可以转换 for number)
  • STRING → DATE (such as果string格式正确)

4.2 显式转换

usingCAST()functionfor显式dataclass型转换.

-- 显式转换example
SELECT 
    CAST('123' AS INT) AS string_to_int,
    CAST(123 AS STRING) AS int_to_string,
    CAST(3.14 AS INT) AS double_to_int,
    CAST('2025-01-15' AS DATE) AS string_to_date,
    CAST(CURRENT_TIMESTAMP() AS DATE) AS timestamp_to_date;

-- creation表时 class型转换
CREATE TABLE IF NOT EXISTS cast_example AS
SELECT 
    id,
    CAST(salary AS DECIMAL(10,2)) AS formatted_salary,
    CAST(hire_date AS STRING) AS hire_date_str
FROM employees;

5. Hive store格式

Hivesupport many 种store格式, 每种格式都 has 其特点 and 适用场景. 选择合适 store格式 for 于optimizationqueryperformance and store空间至关 important .

5.1 commonstore格式比较

store格式 class型 压缩support queryperformance store空间 适用场景
TextFile 行store support low big data加载 simple , 适合logfile
SequenceFile 行store support in in 适合serving asin间store格式
RCFile 行列混合store support in in 适合批量dataprocessing
ORC 行列混合store support high small 适合large-scaledata仓library, complex query
Parquet 列式store support high small 适合OLAPquery, big data量analysis
Avro 行store support in in 适合schema演进, data交换

5.2 TextFile 格式

TextFile is Hive 默认store格式, data以文本行 形式store, 每行一条记录.

-- creationusingTextFile格式 表
CREATE TABLE IF NOT EXISTS text_file_table (
    id INT,
    name STRING,
    age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

-- 加载data
LOAD DATA LOCAL INPATH '/path/to/data.csv' INTO TABLE text_file_table;

5.3 ORC 格式

ORC (optimizationd Row Columnar) is aoptimization 行列混合store格式, providing了 high 效 压缩 and queryperformance.

-- creationusingORC格式 表
CREATE TABLE IF NOT EXISTS orc_table (
    id INT,
    name STRING,
    age INT,
    department STRING,
    salary DOUBLE
)
STORED AS ORC
TBLPROPERTIES (
    'orc.compress'='SNAPPY',
    'orc.bloom.filter.columns'='id,name',
    'orc.dictionary.key.threshold'='0.8',
    'orc.max.string.length'='256'
);

--  from other表插入data
INSERT INTO orc_table
SELECT * FROM employees;

-- querydata
SELECT * FROM orc_table WHERE department='IT' AND salary > 50000;

5.4 Parquet 格式

Parquet is a列式store格式, 适用于OLAPquery and big data量analysis.

-- creationusingParquet格式 表
CREATE TABLE IF NOT EXISTS parquet_table (
    id INT,
    name STRING,
    age INT,
    department STRING,
    salary DOUBLE
)
STORED AS PARQUET
TBLPROPERTIES (
    'parquet.compression'='SNAPPY',
    'parquet.block.size'='256m',
    'parquet.page.size'='1m'
);

--  from other表插入data
INSERT INTO parquet_table
SELECT * FROM employees;

-- querydata
SELECT department, AVG(salary) AS avg_salary
FROM parquet_table
GROUP BY department;

5.5 Avro 格式

Avro is a基于JSON 序列化格式, supportschema演进, 适合data交换.

-- 定义Avro schema
CREATE TABLE IF NOT EXISTS avro_table (
    id INT,
    name STRING,
    age INT,
    department STRING
)
STORED AS AVRO
TBLPROPERTIES (
    'avro.schema.url'='hdfs:///path/to/schema.avsc',
    'avro.codec'='snappy'
);

-- 插入data
INSERT INTO avro_table VALUES
(1, 'Alice', 30, 'IT'),
(2, 'Bob', 25, 'HR');

6. 压缩格式

Hivesupport many 种压缩格式, 可以 and store格式结合using, 进一步reducingstore空间 and improvingqueryperformance.

6.1 common压缩格式比较

压缩格式 压缩比 压缩速度 解压速度 is 否可分割 适用场景
Gzip high in in 适合归档store
Snappy in high high is 适合实时query
LZO in high high is 适合large-scaledataprocessing
Zlib high low in is 适合 high 压缩比requirements
Bzip2 很 high 很 low low is 适合归档store

6.2 configuration压缩格式

-- 临时设置压缩格式 (session级别) 
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

-- creation表时指定压缩格式
CREATE TABLE IF NOT EXISTS compressed_table (
    id INT,
    name STRING,
    age INT
)
STORED AS ORC
TBLPROPERTIES (
    'orc.compress'='SNAPPY'
);

-- 插入data时启用压缩
INSERT OVERWRITE TABLE compressed_table
SELECT * FROM source_table;

7. store格式选择建议

  • TextFile: 适合data加载 simple , 需要频繁update data, or serving asin间store格式
  • ORC: 适合large-scaledata仓library, 需要 complex query and high 压缩比 场景
  • Parquet: 适合OLAPquery, 特别 is 需要analysisspecific列 场景
  • Avro: 适合需要schema演进 data交换场景
  • Snappy压缩: 适合需要 high 压缩 and 解压速度 场景
  • Gzip压缩: 适合需要 high 压缩比 归档场景

实践练习

练习1: basicdataclass型using

  1. creation一个package含所 has basicdataclass型 表
  2. 插入不同class型 data
  3. using各种functionquery and 转换dataclass型

练习2: complex dataclass型using

  1. creationpackage含ARRAY, MAP and STRUCTclass型 表
  2. 插入 complex class型data
  3. usingLATERAL VIEW and explodeunfold complex class型
  4. query and filter complex class型data

练习3: store格式 for 比

  1. creation三个表, 分别usingTextFile, ORC and Parquet格式
  2. 向三个表插入相同 data
  3. 比较三个表 store空间 big small
  4. 比较三个表 queryperformance

练习4: 压缩格式configuration

  1. configuration不同 压缩格式
  2. 比较不同压缩格式 压缩比 and queryperformance
  3. 选择适合业务场景 压缩格式

8. summarized

本tutorial介绍了 Hive dataclass型 and store格式, including:

  • basicdataclass型: 数值class型, stringclass型, booleanclass型 and 日期时间class型
  • complex dataclass型: ARRAY, MAP, STRUCT and UNIONTYPE
  • dataclass型转换: 隐式转换 and 显式转换
  • store格式: TextFile, ORC, Parquet, Avroetc.
  • 压缩格式: Snappy, Gzip, LZOetc.
  • store格式 and 压缩格式 选择建议

through本tutorial Learning, 您应该able toMaster Hive dataclass型 usingmethod, 以及such as何选择合适 store格式 and 压缩格式来optimizationqueryperformance and store空间.