Hive dataclass型 and store格式
1. Hive dataclass型overview
Hivesupport many 种dataclass型, includingbasicdataclass型, complex dataclass型 and in 置functionclass型. 选择合适 dataclass型 for 于optimizationqueryperformance and store空间至关 important .
2. basicdataclass型
Hive basicdataclass型including数值class型, stringclass型, booleanclass型 and 日期时间class型.
| dataclass型 | describes | example |
|---|---|---|
| TINYINT | 1字节 has 符号整数, 范围: -128 to 127 | 42, -10 |
| SMALLINT | 2字节 has 符号整数, 范围: -32768 to 32767 | 1000, -500 |
| INT | 4字节 has 符号整数, 范围: -2^31 to 2^31-1 | 100000, -50000 |
| BIGINT | 8字节 has 符号整数, 范围: -2^63 to 2^63-1 | 1000000000L, -5000000000L |
| FLOAT | 4字节单精度浮点数 | 3.14f, -0.5f |
| DOUBLE | 8字节双精度浮点数 | 3.1415926, -0.12345 |
| DECIMAL(p,s) | high 精度十进制数, p for 总位数, s for small 数位数 | DECIMAL(10,2): 12345.67 |
| BOOLEAN | boolean值, true or false | TRUE, FALSE |
| STRING | 可变 long 度string | 'Hello', "World" |
| VARCHAR(n) | 可变 long 度string, 最 big long 度n | VARCHAR(50): 'Short string' |
| CHAR(n) | 固定 long 度string, long 度n | CHAR(10): 'Fixed ' |
| DATE | 日期class型, 格式: YYYY-MM-DD | '2025-01-15' |
| TIMESTAMP | 时间戳class型, 格式: YYYY-MM-DD HH:MM:SS.fffffffff | '2025-01-15 14:30:45.123456789' |
2.1 数值class型example
-- creationpackage含数值class型 表
CREATE TABLE IF NOT EXISTS numeric_types (
tiny_col TINYINT,
small_col SMALLINT,
int_col INT,
big_col BIGINT,
float_col FLOAT,
double_col DOUBLE,
decimal_col DECIMAL(10,2)
);
-- 插入data
INSERT INTO numeric_types VALUES
(100, 20000, 3000000, 4000000000L, 3.14f, 3.1415926, 12345.67);
-- querydata
SELECT * FROM numeric_types;
2.2 stringclass型example
-- creationpackage含stringclass型 表
CREATE TABLE IF NOT EXISTS string_types (
string_col STRING,
varchar_col VARCHAR(50),
char_col CHAR(10)
);
-- 插入data
INSERT INTO string_types VALUES
('This is a string', 'This is a varchar', 'char fixed');
-- querydata
SELECT * FROM string_types;
-- 查看string long 度
SELECT
LENGTH(string_col) AS string_len,
LENGTH(varchar_col) AS varchar_len,
LENGTH(char_col) AS char_len
FROM string_types;
2.3 日期时间class型example
-- creationpackage含日期时间class型 表
CREATE TABLE IF NOT EXISTS datetime_types (
date_col DATE,
timestamp_col TIMESTAMP
);
-- 插入data
INSERT INTO datetime_types VALUES
('2025-01-15', '2025-01-15 14:30:45.123456789');
-- usingfunction插入当 before 日期 and 时间
INSERT INTO datetime_types VALUES
(CURRENT_DATE(), CURRENT_TIMESTAMP());
-- querydata
SELECT * FROM datetime_types;
-- 日期时间functionusing
SELECT
date_col,
YEAR(date_col) AS year,
MONTH(date_col) AS month,
DAY(date_col) AS day,
timestamp_col,
HOUR(timestamp_col) AS hour,
MINUTE(timestamp_col) AS minute,
SECOND(timestamp_col) AS second
FROM datetime_types;
3. complex dataclass型
Hivesupport四种 complex dataclass型: ARRAY, MAP, STRUCT and UNIONTYPE.
3.1 ARRAYclass型
ARRAYclass型用于store相同class型 元素collection, 元素throughindex访问.
-- creationpackage含ARRAYclass型 表
CREATE TABLE IF NOT EXISTS array_table (
id INT,
name STRING,
hobbies ARRAY<STRING>
);
-- 插入data
INSERT INTO array_table VALUES
(1, 'Alice', ARRAY('Reading', 'Traveling', 'Photography')),
(2, 'Bob', ARRAY('Sports', 'Music')),
(3, 'Charlie', ARRAY('Cooking', 'Gardening', 'Painting'));
-- queryarray long 度
SELECT id, name, hobbies, SIZE(hobbies) AS hobby_count FROM array_table;
-- 访问array元素 (index from 0开始)
SELECT id, name, hobbies[0] AS first_hobby, hobbies[1] AS second_hobby FROM array_table;
-- checkarray is 否package含specific元素
SELECT id, name FROM array_table WHERE array_contains(hobbies, 'Reading');
-- unfoldarray
SELECT id, name, hobby
FROM array_table
LATERAL VIEW explode(hobbies) h AS hobby;
3.2 MAPclass型
MAPclass型用于store键值 for collection, 键 and 值可以 is 任意class型.
-- creationpackage含MAPclass型 表
CREATE TABLE IF NOT EXISTS map_table (
id INT,
name STRING,
scores MAP<STRING, INT>
);
-- 插入data
INSERT INTO map_table VALUES
(1, 'Alice', MAP('Math', 90, 'English', 85, 'Science', 95)),
(2, 'Bob', MAP('Math', 80, 'History', 75)),
(3, 'Charlie', MAP('English', 88, 'Science', 92, 'Art', 90));
-- 访问MAP值
SELECT id, name, scores['Math'] AS math_score, scores['English'] AS english_score FROM map_table;
-- checkMAP is 否package含specific键
SELECT id, name FROM map_table WHERE map_contains_key(scores, 'Math');
-- 获取MAP 键 and 值
SELECT id, name, map_keys(scores) AS subjects, map_values(scores) AS score_values FROM map_table;
-- unfoldMAP
SELECT id, name, subject, score
FROM map_table
LATERAL VIEW explode(scores) s AS subject, score;
3.3 STRUCTclass型
STRUCTclass型用于store不同class型 字段collection, 字段through名称访问.
-- creationpackage含STRUCTclass型 表
CREATE TABLE IF NOT EXISTS struct_table (
id INT,
name STRING,
address STRUCT<city:STRING, street:STRING, zip:INT>
);
-- 插入data
INSERT INTO struct_table VALUES
(1, 'Alice', NAMED_STRUCT('city', 'Beijing', 'street', 'Main St', 'zip', 100000)),
(2, 'Bob', NAMED_STRUCT('city', 'Shanghai', 'street', 'Second Ave', 'zip', 200000));
-- 访问STRUCT字段
SELECT id, name, address.city, address.street, address.zip FROM struct_table;
3.4 UNIONTYPEclass型
UNIONTYPEclass型用于store不同class型 值, 但每次只能store一种class型 值.
-- creationpackage含UNIONTYPEclass型 表
CREATE TABLE IF NOT EXISTS union_table (
id INT,
union_col UNIONTYPE<INT, STRING, BOOLEAN>
);
-- 插入data (using不同class型)
INSERT INTO union_table VALUES
(1, CAST(100 AS UNIONTYPE<INT, STRING, BOOLEAN>)),
(2, CAST('string value' AS UNIONTYPE<INT, STRING, BOOLEAN>)),
(3, CAST(TRUE AS UNIONTYPE<INT, STRING, BOOLEAN>));
-- queryUNIONTYPE值
SELECT id, union_col FROM union_table;
-- 获取UNIONTYPE class型 and 值
SELECT
id,
union_col,
unión_tag(union_col) AS tag,
CASE
WHEN unión_tag(union_col) = 0 THEN CAST(union_col AS INT)
WHEN unión_tag(union_col) = 1 THEN CAST(union_col AS STRING)
WHEN unión_tag(union_col) = 2 THEN CAST(union_col AS BOOLEAN)
END AS value
FROM union_table;
4. dataclass型转换
Hivesupport显式 and 隐式dataclass型转换.
4.1 隐式转换
Hive会自动for某些dataclass型转换, 例such as:
- TINYINT → SMALLINT → INT → BIGINT → FLOAT → DOUBLE
- STRING → DOUBLE (such as果string可以转换 for number)
- STRING → DATE (such as果string格式正确)
4.2 显式转换
usingCAST()functionfor显式dataclass型转换.
-- 显式转换example
SELECT
CAST('123' AS INT) AS string_to_int,
CAST(123 AS STRING) AS int_to_string,
CAST(3.14 AS INT) AS double_to_int,
CAST('2025-01-15' AS DATE) AS string_to_date,
CAST(CURRENT_TIMESTAMP() AS DATE) AS timestamp_to_date;
-- creation表时 class型转换
CREATE TABLE IF NOT EXISTS cast_example AS
SELECT
id,
CAST(salary AS DECIMAL(10,2)) AS formatted_salary,
CAST(hire_date AS STRING) AS hire_date_str
FROM employees;
5. Hive store格式
Hivesupport many 种store格式, 每种格式都 has 其特点 and 适用场景. 选择合适 store格式 for 于optimizationqueryperformance and store空间至关 important .
5.1 commonstore格式比较
| store格式 | class型 | 压缩support | queryperformance | store空间 | 适用场景 |
|---|---|---|---|---|---|
| TextFile | 行store | support | low | big | data加载 simple , 适合logfile |
| SequenceFile | 行store | support | in | in | 适合serving asin间store格式 |
| RCFile | 行列混合store | support | in | in | 适合批量dataprocessing |
| ORC | 行列混合store | support | high | small | 适合large-scaledata仓library, complex query |
| Parquet | 列式store | support | high | small | 适合OLAPquery, big data量analysis |
| Avro | 行store | support | in | in | 适合schema演进, data交换 |
5.2 TextFile 格式
TextFile is Hive 默认store格式, data以文本行 形式store, 每行一条记录.
-- creationusingTextFile格式 表
CREATE TABLE IF NOT EXISTS text_file_table (
id INT,
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
-- 加载data
LOAD DATA LOCAL INPATH '/path/to/data.csv' INTO TABLE text_file_table;
5.3 ORC 格式
ORC (optimizationd Row Columnar) is aoptimization 行列混合store格式, providing了 high 效 压缩 and queryperformance.
-- creationusingORC格式 表
CREATE TABLE IF NOT EXISTS orc_table (
id INT,
name STRING,
age INT,
department STRING,
salary DOUBLE
)
STORED AS ORC
TBLPROPERTIES (
'orc.compress'='SNAPPY',
'orc.bloom.filter.columns'='id,name',
'orc.dictionary.key.threshold'='0.8',
'orc.max.string.length'='256'
);
-- from other表插入data
INSERT INTO orc_table
SELECT * FROM employees;
-- querydata
SELECT * FROM orc_table WHERE department='IT' AND salary > 50000;
5.4 Parquet 格式
Parquet is a列式store格式, 适用于OLAPquery and big data量analysis.
-- creationusingParquet格式 表
CREATE TABLE IF NOT EXISTS parquet_table (
id INT,
name STRING,
age INT,
department STRING,
salary DOUBLE
)
STORED AS PARQUET
TBLPROPERTIES (
'parquet.compression'='SNAPPY',
'parquet.block.size'='256m',
'parquet.page.size'='1m'
);
-- from other表插入data
INSERT INTO parquet_table
SELECT * FROM employees;
-- querydata
SELECT department, AVG(salary) AS avg_salary
FROM parquet_table
GROUP BY department;
5.5 Avro 格式
Avro is a基于JSON 序列化格式, supportschema演进, 适合data交换.
-- 定义Avro schema
CREATE TABLE IF NOT EXISTS avro_table (
id INT,
name STRING,
age INT,
department STRING
)
STORED AS AVRO
TBLPROPERTIES (
'avro.schema.url'='hdfs:///path/to/schema.avsc',
'avro.codec'='snappy'
);
-- 插入data
INSERT INTO avro_table VALUES
(1, 'Alice', 30, 'IT'),
(2, 'Bob', 25, 'HR');
6. 压缩格式
Hivesupport many 种压缩格式, 可以 and store格式结合using, 进一步reducingstore空间 and improvingqueryperformance.
6.1 common压缩格式比较
| 压缩格式 | 压缩比 | 压缩速度 | 解压速度 | is 否可分割 | 适用场景 |
|---|---|---|---|---|---|
| Gzip | high | in | in | 否 | 适合归档store |
| Snappy | in | high | high | is | 适合实时query |
| LZO | in | high | high | is | 适合large-scaledataprocessing |
| Zlib | high | low | in | is | 适合 high 压缩比requirements |
| Bzip2 | 很 high | 很 low | low | is | 适合归档store |
6.2 configuration压缩格式
-- 临时设置压缩格式 (session级别)
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
-- creation表时指定压缩格式
CREATE TABLE IF NOT EXISTS compressed_table (
id INT,
name STRING,
age INT
)
STORED AS ORC
TBLPROPERTIES (
'orc.compress'='SNAPPY'
);
-- 插入data时启用压缩
INSERT OVERWRITE TABLE compressed_table
SELECT * FROM source_table;
7. store格式选择建议
- TextFile: 适合data加载 simple , 需要频繁update data, or serving asin间store格式
- ORC: 适合large-scaledata仓library, 需要 complex query and high 压缩比 场景
- Parquet: 适合OLAPquery, 特别 is 需要analysisspecific列 场景
- Avro: 适合需要schema演进 data交换场景
- Snappy压缩: 适合需要 high 压缩 and 解压速度 场景
- Gzip压缩: 适合需要 high 压缩比 归档场景
实践练习
练习1: basicdataclass型using
- creation一个package含所 has basicdataclass型 表
- 插入不同class型 data
- using各种functionquery and 转换dataclass型
练习2: complex dataclass型using
- creationpackage含ARRAY, MAP and STRUCTclass型 表
- 插入 complex class型data
- usingLATERAL VIEW and explodeunfold complex class型
- query and filter complex class型data
练习3: store格式 for 比
- creation三个表, 分别usingTextFile, ORC and Parquet格式
- 向三个表插入相同 data
- 比较三个表 store空间 big small
- 比较三个表 queryperformance
练习4: 压缩格式configuration
- configuration不同 压缩格式
- 比较不同压缩格式 压缩比 and queryperformance
- 选择适合业务场景 压缩格式
8. summarized
本tutorial介绍了 Hive dataclass型 and store格式, including:
- basicdataclass型: 数值class型, stringclass型, booleanclass型 and 日期时间class型
- complex dataclass型: ARRAY, MAP, STRUCT and UNIONTYPE
- dataclass型转换: 隐式转换 and 显式转换
- store格式: TextFile, ORC, Parquet, Avroetc.
- 压缩格式: Snappy, Gzip, LZOetc.
- store格式 and 压缩格式 选择建议
through本tutorial Learning, 您应该able toMaster Hive dataclass型 usingmethod, 以及such as何选择合适 store格式 and 压缩格式来optimizationqueryperformance and store空间.