HBasedistributeddatalibrary
1. HBaseIntroduction
HBase is 建立 in Hadoop之 on distributed列式datalibrary, 用于storelarge-scalestructure化 and 半structure化data. 它providing了实时随机读写capacity, 适合processing海量data in 线query and update.
1.1 HBase design目标
- high reliability: throughdatacopymechanism, 确保data in nodefailure时不会loss
- high scale性: 可以through添加node来improvingstore and processingcapacity
- 实时读写: support low latency 随机读写operation
- 海量datastore: 适合storePB级 data
- flexible datamodel: support动态列 and many versiondata
1.2 HBase and 传统datalibrary 区别
| features | HBase | 传统relationships型datalibrary |
|---|---|---|
| datamodel | 列式store, 稀疏矩阵 | 行式store, 固定 schema |
| transactionsupport | 单行transaction | ACIDtransaction |
| querylanguage | HBase Shell, Java API, REST API | SQL |
| scale性 | 水平scale, supportPB级data | 垂直scale, scale性 has 限 |
| latency | low latency, 适合实时读写 | low latency, 适合实时读写 |
| 适用场景 | 海量data 实时读写, such asloganalysis, 时序dataetc. | structure化data complex query and transactionprocessing |
2. HBasecore concepts
2.1 datamodel
HBase datamodel is a 稀疏 , distributed , 持久化 many 维 has 序map, 它可以看作 is a big 表格, 但 and 传统relationships型datalibrary 表格 has 很 big 不同.
2.2 core术语
- 表 (Table) : 由行 and 列组成, 用于storedata
- 行 (Row) : 由行键 (Row Key) 唯一标识, 行 in data按照行键dictionary序sort
- 行键 (Row Key) : 行 唯一标识符, class似于relationships型datalibraryin 主键
- 列族 (Column Family) : 列 collection, is HBasein最 small store单元, 所 has 列族成员都具 has 相同 before 缀
- 列限定符 (Column Qualifier) : 列族in 具体列, 由列族 and 限定符共同组成
- 单元格 (Cell) : 由行键, 列族 and 列限定符唯一确定, store具体 data值
- 时间戳 (Timestamp) : 每个单元格可以store many 个version data, 时间戳用于标识不同version
Row Key | Column Family:Column Qualifier | Timestamp | Value --------|--------------------------------|-----------|------- row1 | cf1:col1 | 123456789 | value1 row1 | cf1:col2 | 123456789 | value2 row1 | cf2:col1 | 123456789 | value3 row2 | cf1:col1 | 123456789 | value4
2.3 namespace (Namespace)
namespace用于组织表, class似于relationships型datalibraryin datalibrary. HBase has 两个默认 namespace:
- default: 默认namespace, 用于storeusercreation 表
- hbase: systemnamespace, 用于storeHBase in 部表
3. HBasearchitecturedesign
HBaseadoptsmaster-slave (Master-Slave) architecture, 主要package含以 under component:
3.1 HMaster
HMaster is HBase 主node, 负责management整个cluster 元data and 协调工作. 它 主要functionsincluding:
- 表management: 负责表 creation, modify and delete
- Regionmanagement: 负责Region 分配, migration and merge
- nodemanagement: monitorRegionServer status, processingRegionServer failure
- permissionmanagement: managementuser permission
3.2 RegionServer
RegionServer is HBase from node, 负责processing客户端 读写request and managementRegion. 它 主要functionsincluding:
- processing客户端request: processing客户端 读, 写, 扫描etc.request
- Regionmanagement: management分配给它 Region
- datastore: 将datastore in HDFS on
- cachemanagement: managementBlockCache and MemStore
- Compaction and Split: 执行Compaction and Splitoperation
3.3 Region
Region is HBaseindata basicstore单元, 每个Region负责store表in一个连续范围 in data. 当Region big small 超过阈值时, 会被分裂成两个 new Region.
3.4 Zookeeper
Zookeeper用于协调HBasecluster, 主要functionsincluding:
- HMaster选举: 当Active HMasterfailure时, from Standby HMasterin选举 new Active HMaster
- RegionServerregister: RegionServer向Zookeeperregister, HMasterthroughZookeeper获取RegionServerlist
- 元datastore: storeHBase 元data, such as表 位置information
- distributedlock: providingdistributedlockmechanism, 用于协调clusterin operation
3.5 HDFS
HDFS用于storeHBase datafile and logfile, 主要including:
- HFile: HBase practicaldatastore格式, store in HDFS on
- WAL (Write-Ahead Log) : 预写log, 用于restoredata
- snapshotfile: 用于databackup and restore
4. HBaseinstallation and configuration
4.1 installation before 提
installationHBase之 before , 需要先installation:
- Java JDK 1.8+
- Hadoop 2.x or 3.x
- Zookeeper 3.4+
4.2 installation步骤
# 1. under 载HBase $ wget https://dlcdn.apache.org/hbase/2.4.11/hbase-2.4.11-bin.tar.gz # 2. 解压HBase $ tar -xzvf hbase-2.4.11-bin.tar.gz $ mv hbase-2.4.11 /opt/hbase # 3. configurationenvironmentvariable $ export HBASE_HOME=/opt/hbase $ export PATH=$PATH:$HBASE_HOME/bin # 4. configurationHBase $ cp $HBASE_HOME/conf/hbase-env.sh.template $HBASE_HOME/conf/hbase-env.sh $ echo "export JAVA_HOME=/opt/java" >> $HBASE_HOME/conf/hbase-env.sh $ echo "export HBASE_CLASSPATH=/opt/hadoop/etc/hadoop" >> $HBASE_HOME/conf/hbase-env.sh $ echo "export HBASE_MANAGES_ZK=false" >> $HBASE_HOME/conf/hbase-env.sh # 5. configurationhbase-site.xml $ vim $HBASE_HOME/conf/hbase-site.xml# 6. configurationregionservers $ echo "localhost" > $HBASE_HOME/conf/regionservers # 7. 启动HBase $ start-hbase.sh # 8. verificationHBase启动 $ jps # 应该看 to HMaster and HRegionServerprocess hbase.rootdir hdfs://localhost:9000/hbase hbase.cluster.distributed true hbase.zookeeper.quorum localhost hbase.zookeeper.property.dataDir /opt/zookeeper/data
5. HBasebasicoperation
5.1 HBase Shell
HBase Shell is HBaseproviding commands行tool, 用于 and HBase交互.
# 启动HBase Shell
$ hbase shell
# 列出所 has 表
hbase(main):001:0> list
# creation表
hbase(main):002:0> create 'users', 'cf1', 'cf2'
# 查看表structure
hbase(main):003:0> describe 'users'
# 插入data
hbase(main):004:0> put 'users', 'row1', 'cf1:name', 'Alice'
hbase(main):005:0> put 'users', 'row1', 'cf1:age', '30'
hbase(main):006:0> put 'users', 'row1', 'cf2:email', 'alice@example.com'
hbase(main):007:0> put 'users', 'row2', 'cf1:name', 'Bob'
hbase(main):008:0> put 'users', 'row2', 'cf1:age', '28'
hbase(main):009:0> put 'users', 'row2', 'cf2:email', 'bob@example.com'
# 获取单行data
hbase(main):010:0> get 'users', 'row1'
# 获取指定列族 data
hbase(main):011:0> get 'users', 'row1', 'cf1'
# 获取指定列 data
hbase(main):012:0> get 'users', 'row1', 'cf1:name'
# 扫描表
hbase(main):013:0> scan 'users'
# 扫描指定范围 data
hbase(main):014:0> scan 'users', {STARTROW => 'row1', ENDROW => 'row2'}
# updatedata
hbase(main):015:0> put 'users', 'row1', 'cf1:age', '31'
# delete指定列
hbase(main):016:0> delete 'users', 'row1', 'cf2:email'
# delete指定行
hbase(main):017:0> deleteall 'users', 'row2'
# 禁用表
hbase(main):018:0> disable 'users'
# delete表
hbase(main):019:0> drop 'users'
# 退出HBase Shell
hbase(main):020:0> exit
5.2 Java API
HBaseproviding了Java API, 用于writingJava程序 and HBase交互.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import java.io.IOException;
public class HBaseExample {
private static final String TABLE_NAME = "users";
private static final String CF1 = "cf1";
private static final String CF2 = "cf2";
private static final String ROW_KEY = "row1";
public static void main(String[] args) throws IOException {
// 1. creationconfiguration
Configuration config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", "localhost");
// 2. 连接HBase
Connection connection = ConnectionFactory.createConnection(config);
Admin admin = connection.getAdmin();
try {
// 3. creation表
TableName tableName = TableName.valueOf(TABLE_NAME);
if (!admin.tableExists(tableName)) {
TableDescriptorbuilder tableDesc = TableDescriptorbuilder.newbuilder(tableName);
ColumnFamilyDescriptor cf1Desc = ColumnFamilyDescriptorbuilder.newbuilder(Bytes.toBytes(CF1)).build();
ColumnFamilyDescriptor cf2Desc = ColumnFamilyDescriptorbuilder.newbuilder(Bytes.toBytes(CF2)).build();
tableDesc.setColumnFamily(cf1Desc);
tableDesc.setColumnFamily(cf2Desc);
admin.createTable(tableDesc.build());
System.out.println("Table created: " + TABLE_NAME);
}
// 4. 获取表instance
Table table = connection.getTable(TableName.valueOf(TABLE_NAME));
try {
// 5. 插入data
Put put = new Put(Bytes.toBytes(ROW_KEY));
put.addColumn(Bytes.toBytes(CF1), Bytes.toBytes("name"), Bytes.toBytes("Alice"));
put.addColumn(Bytes.toBytes(CF1), Bytes.toBytes("age"), Bytes.toBytes("30"));
put.addColumn(Bytes.toBytes(CF2), Bytes.toBytes("email"), Bytes.toBytes("alice@example.com"));
table.put(put);
System.out.println("Data inserted");
// 6. 获取data
Get get = new Get(Bytes.toBytes(ROW_KEY));
Result result = table.get(get);
byte[] name = result.getValue(Bytes.toBytes(CF1), Bytes.toBytes("name"));
byte[] age = result.getValue(Bytes.toBytes(CF1), Bytes.toBytes("age"));
byte[] email = result.getValue(Bytes.toBytes(CF2), Bytes.toBytes("email"));
System.out.println("Name: " + Bytes.toString(name));
System.out.println("Age: " + Bytes.toString(age));
System.out.println("Email: " + Bytes.toString(email));
// 7. 扫描表
Scan scan = new Scan();
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println("Row: " + Bytes.toString(res.getRow()));
for (Cell cell : res.listCells()) {
String cf = Bytes.toString(CellUtil.cloneFamily(cell));
String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
String value = Bytes.toString(CellUtil.cloneValue(cell));
System.out.println(" " + cf + ":" + qualifier + " = " + value);
}
}
scanner.close();
} finally {
table.close();
}
} finally {
admin.close();
connection.close();
}
}
}
5. HBasebest practices
5.1 行键design
行键design is HBaseperformanceoptimization 关键, 以 under is 一些行键design best practices:
- 唯一性: 行键必须唯一, 用于标识一行data
- long 度适in: 行键 long 度不宜过 long , 建议控制 in 10-100字节之间
- 避免热点issues: 行键 before 缀应该分布均匀, 避免 big 量request集in in few 数Region on
- has 序性: 行键 is has 序store , 可以利用这一featuresdesign行键, improvingqueryefficiency
- 组合行键: for 于 complex query, 可以using组合行键, such as时间戳+ID
5.2 列族design
- 数量适in: 列族数量不宜过 many , 建议不超过3个
- data访问模式: 将具 has 相似访问模式 列放 in 同一个列族in
- data big small : 避免 in 一个列族instore big 量data, 建议每个列族 data big small 控制 in 合理范围 in
5.3 Regiondesign
- 预partition: creation表时for预partition, 避免热点issues
- Region big small : 合理设置Region big small , 建议 for 10-50GB
- Regionmerge: 定期merge small Region, improvingqueryefficiency
5.4 performanceoptimization
- 启用BlockCache: improving读performance
- 调整MemStore big small : 根据practicalcircumstances调整MemStore big small
- using批量operation: using批量Put, Get and Deleteoperation, reducingRPC调用
- 启用压缩: for HFilefor压缩, reducingstore空间 and I/O
- 合理设置WAL: 根据datareliability要求, 选择合适 WAL级别
实践练习
练习1: HBaseinstallation and configuration
installation并configurationHBase, using out 部Zookeeper.
练习2: HBase Shelloperation
usingHBase Shellcompletion以 under operation:
- creation一个名 for "students" 表, package含"info" and "grades"两个列族
- 向表in插入3条学生记录, package含姓名, 年龄, 性别, 数学成绩 and 英语成绩
- query其in一个学生 所 has information
- query所 has 学生 姓名 and 年龄
- update其in一个学生 数学成绩
- delete其in一个学生 英语成绩
- delete其in一个学生 所 has 记录
- delete该表
练习3: Java APIoperation
writingJava程序, usingHBase Java APIcompletion以 under operation:
- creation一个名 for "products" 表, package含"details" and "inventory"两个列族
- 向表in插入5条产品记录
- query所 has 产品 名称 and 价格
- update其in一个产品 library存
- delete该表
6. summarized
本tutorial深入介绍了HBasedistributeddatalibrary basicconcepts, architecturedesign and usingmethod. HBaseserving as建立 in Hadoop之 on distributed列式datalibrary, 具 has high reliability, high scale性 and 实时读写capacity, 适合processing海量data.
HBase corecomponentincludingHMaster, RegionServer, Zookeeper and HDFS. HMaster负责management整个cluster 元data and 协调工作, RegionServer负责processing客户端 读写request and managementRegion, Zookeeper用于协调cluster, HDFS用于storedatafile and logfile.
HBase datamodel is a 稀疏 , distributed , 持久化 many 维 has 序map, support动态列 and many versiondata. HBaseproviding了 many 种交互方式, includingHBase Shell, Java API and REST APIetc..
in usingHBase时, 需要注意行键design, 列族design and Regiondesignetc.best practices, 以improvingHBase performance and reliability. HBase适合processing海量data 实时读写场景, such asloganalysis, 时序data, user画像etc..