HBasedistributeddatalibrary

1. HBaseIntroduction

HBase is 建立 in Hadoop之 on distributed列式datalibrary, 用于storelarge-scalestructure化 and 半structure化data. 它providing了实时随机读写capacity, 适合processing海量data in 线query and update.

1.1 HBase design目标

high reliability: throughdatacopymechanism, 确保data in nodefailure时不会loss
high scale性: 可以through添加node来improvingstore and processingcapacity
实时读写: support low latency 随机读写operation
海量datastore: 适合storePB级 data
flexible datamodel: support动态列 and many versiondata

1.2 HBase and 传统datalibrary 区别

features	HBase	传统relationships型datalibrary
datamodel	列式store, 稀疏矩阵	行式store, 固定 schema
transactionsupport	单行transaction	ACIDtransaction
querylanguage	HBase Shell, Java API, REST API	SQL
scale性	水平scale, supportPB级data	垂直scale, scale性 has 限
latency	low latency, 适合实时读写	low latency, 适合实时读写
适用场景	海量data 实时读写, such asloganalysis, 时序dataetc.	structure化data complex query and transactionprocessing

2. HBasecore concepts

2.1 datamodel

HBase datamodel is a 稀疏 , distributed , 持久化 many 维 has 序map, 它可以看作 is a big 表格, 但 and 传统relationships型datalibrary 表格 has 很 big 不同.

2.2 core术语

表 (Table) : 由行 and 列组成, 用于storedata
行 (Row) : 由行键 (Row Key) 唯一标识, 行 in data按照行键dictionary序sort
行键 (Row Key) : 行唯一标识符, class似于relationships型datalibraryin 主键
列族 (Column Family) : 列 collection, is HBasein最 small store单元, 所 has 列族成员都具 has 相同 before 缀
列限定符 (Column Qualifier) : 列族in 具体列, 由列族 and 限定符共同组成
单元格 (Cell) : 由行键, 列族 and 列限定符唯一确定, store具体 data值
时间戳 (Timestamp) : 每个单元格可以store many 个version data, 时间戳用于标识不同version

Row Key | Column Family:Column Qualifier | Timestamp | Value
--------|--------------------------------|-----------|-------
row1    | cf1:col1                      | 123456789 | value1
row1    | cf1:col2                      | 123456789 | value2
row1    | cf2:col1                      | 123456789 | value3
row2    | cf1:col1                      | 123456789 | value4

2.3 namespace (Namespace)

namespace用于组织表, class似于relationships型datalibraryin datalibrary. HBase has 两个默认 namespace:

default: 默认namespace, 用于storeusercreation 表
hbase: systemnamespace, 用于storeHBase in 部表

3. HBasearchitecturedesign

HBaseadoptsmaster-slave (Master-Slave) architecture, 主要package含以 under component:

3.1 HMaster

HMaster is HBase 主node, 负责management整个cluster 元data and 协调工作. 它主要functionsincluding:

表management: 负责表 creation, modify and delete
Regionmanagement: 负责Region 分配, migration and merge
nodemanagement: monitorRegionServer status, processingRegionServer failure
permissionmanagement: managementuser permission

3.2 RegionServer

RegionServer is HBase from node, 负责processing客户端读写request and managementRegion. 它主要functionsincluding:

processing客户端request: processing客户端读, 写, 扫描etc.request
Regionmanagement: management分配给它 Region
datastore: 将datastore in HDFS on
cachemanagement: managementBlockCache and MemStore
Compaction and Split: 执行Compaction and Splitoperation

3.3 Region

Region is HBaseindata basicstore单元, 每个Region负责store表in一个连续范围 in data. 当Region big small 超过阈值时, 会被分裂成两个 new Region.

3.4 Zookeeper

Zookeeper用于协调HBasecluster, 主要functionsincluding:

HMaster选举: 当Active HMasterfailure时, from Standby HMasterin选举 new Active HMaster
RegionServerregister: RegionServer向Zookeeperregister, HMasterthroughZookeeper获取RegionServerlist
元datastore: storeHBase 元data, such as表位置information
distributedlock: providingdistributedlockmechanism, 用于协调clusterin operation

3.5 HDFS

HDFS用于storeHBase datafile and logfile, 主要including:

HFile: HBase practicaldatastore格式, store in HDFS on
WAL (Write-Ahead Log) : 预写log, 用于restoredata
snapshotfile: 用于databackup and restore

4. HBaseinstallation and configuration

4.1 installation before 提

installationHBase之 before , 需要先installation:

Java JDK 1.8+
Hadoop 2.x or 3.x
Zookeeper 3.4+

4.2 installation步骤

# 1.  under 载HBase
$ wget https://dlcdn.apache.org/hbase/2.4.11/hbase-2.4.11-bin.tar.gz

# 2. 解压HBase
$ tar -xzvf hbase-2.4.11-bin.tar.gz
$ mv hbase-2.4.11 /opt/hbase

# 3. configurationenvironmentvariable
$ export HBASE_HOME=/opt/hbase
$ export PATH=$PATH:$HBASE_HOME/bin

# 4. configurationHBase
$ cp $HBASE_HOME/conf/hbase-env.sh.template $HBASE_HOME/conf/hbase-env.sh
$ echo "export JAVA_HOME=/opt/java" >> $HBASE_HOME/conf/hbase-env.sh
$ echo "export HBASE_CLASSPATH=/opt/hadoop/etc/hadoop" >> $HBASE_HOME/conf/hbase-env.sh
$ echo "export HBASE_MANAGES_ZK=false" >> $HBASE_HOME/conf/hbase-env.sh

# 5. configurationhbase-site.xml
$ vim $HBASE_HOME/conf/hbase-site.xml

  
    hbase.rootdir
    hdfs://localhost:9000/hbase
  
  
    hbase.cluster.distributed
    true
  
  
    hbase.zookeeper.quorum
    localhost
  
  
    hbase.zookeeper.property.dataDir
    /opt/zookeeper/data
  


# 6. configurationregionservers
$ echo "localhost" > $HBASE_HOME/conf/regionservers

# 7. 启动HBase
$ start-hbase.sh

# 8. verificationHBase启动
$ jps
# 应该看 to HMaster and HRegionServerprocess

5. HBasebasicoperation

5.1 HBase Shell

HBase Shell is HBaseproviding commands行tool, 用于 and HBase交互.

# 启动HBase Shell
$ hbase shell

# 列出所 has 表
hbase(main):001:0> list

# creation表
hbase(main):002:0> create 'users', 'cf1', 'cf2'

# 查看表structure
hbase(main):003:0> describe 'users'

# 插入data
hbase(main):004:0> put 'users', 'row1', 'cf1:name', 'Alice'
hbase(main):005:0> put 'users', 'row1', 'cf1:age', '30'
hbase(main):006:0> put 'users', 'row1', 'cf2:email', 'alice@example.com'
hbase(main):007:0> put 'users', 'row2', 'cf1:name', 'Bob'
hbase(main):008:0> put 'users', 'row2', 'cf1:age', '28'
hbase(main):009:0> put 'users', 'row2', 'cf2:email', 'bob@example.com'

# 获取单行data
hbase(main):010:0> get 'users', 'row1'

# 获取指定列族 data
hbase(main):011:0> get 'users', 'row1', 'cf1'

# 获取指定列 data
hbase(main):012:0> get 'users', 'row1', 'cf1:name'

# 扫描表
hbase(main):013:0> scan 'users'

# 扫描指定范围 data
hbase(main):014:0> scan 'users', {STARTROW => 'row1', ENDROW => 'row2'}

# updatedata
hbase(main):015:0> put 'users', 'row1', 'cf1:age', '31'

# delete指定列
hbase(main):016:0> delete 'users', 'row1', 'cf2:email'

# delete指定行
hbase(main):017:0> deleteall 'users', 'row2'

# 禁用表
hbase(main):018:0> disable 'users'

# delete表
hbase(main):019:0> drop 'users'

# 退出HBase Shell
hbase(main):020:0> exit

5.2 Java API

HBaseproviding了Java API, 用于writingJava程序 and HBase交互.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;

public class HBaseExample {
    private static final String TABLE_NAME = "users";
    private static final String CF1 = "cf1";
    private static final String CF2 = "cf2";
    private static final String ROW_KEY = "row1";

    public static void main(String[] args) throws IOException {
        // 1. creationconfiguration
        Configuration config = HBaseConfiguration.create();
        config.set("hbase.zookeeper.quorum", "localhost");

        // 2. 连接HBase
        Connection connection = ConnectionFactory.createConnection(config);
        Admin admin = connection.getAdmin();

        try {
            // 3. creation表
            TableName tableName = TableName.valueOf(TABLE_NAME);
            if (!admin.tableExists(tableName)) {
                TableDescriptorbuilder tableDesc = TableDescriptorbuilder.newbuilder(tableName);
                ColumnFamilyDescriptor cf1Desc = ColumnFamilyDescriptorbuilder.newbuilder(Bytes.toBytes(CF1)).build();
                ColumnFamilyDescriptor cf2Desc = ColumnFamilyDescriptorbuilder.newbuilder(Bytes.toBytes(CF2)).build();
                tableDesc.setColumnFamily(cf1Desc);
                tableDesc.setColumnFamily(cf2Desc);
                admin.createTable(tableDesc.build());
                System.out.println("Table created: " + TABLE_NAME);
            }

            // 4. 获取表instance
            Table table = connection.getTable(TableName.valueOf(TABLE_NAME));

            try {
                // 5. 插入data
                Put put = new Put(Bytes.toBytes(ROW_KEY));
                put.addColumn(Bytes.toBytes(CF1), Bytes.toBytes("name"), Bytes.toBytes("Alice"));
                put.addColumn(Bytes.toBytes(CF1), Bytes.toBytes("age"), Bytes.toBytes("30"));
                put.addColumn(Bytes.toBytes(CF2), Bytes.toBytes("email"), Bytes.toBytes("alice@example.com"));
                table.put(put);
                System.out.println("Data inserted");

                // 6. 获取data
                Get get = new Get(Bytes.toBytes(ROW_KEY));
                Result result = table.get(get);
                byte[] name = result.getValue(Bytes.toBytes(CF1), Bytes.toBytes("name"));
                byte[] age = result.getValue(Bytes.toBytes(CF1), Bytes.toBytes("age"));
                byte[] email = result.getValue(Bytes.toBytes(CF2), Bytes.toBytes("email"));
                System.out.println("Name: " + Bytes.toString(name));
                System.out.println("Age: " + Bytes.toString(age));
                System.out.println("Email: " + Bytes.toString(email));

                // 7. 扫描表
                Scan scan = new Scan();
                ResultScanner scanner = table.getScanner(scan);
                for (Result res : scanner) {
                    System.out.println("Row: " + Bytes.toString(res.getRow()));
                    for (Cell cell : res.listCells()) {
                        String cf = Bytes.toString(CellUtil.cloneFamily(cell));
                        String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
                        String value = Bytes.toString(CellUtil.cloneValue(cell));
                        System.out.println("  " + cf + ":" + qualifier + " = " + value);
                    }
                }
                scanner.close();
            } finally {
                table.close();
            }
        } finally {
            admin.close();
            connection.close();
        }
    }
}

5. HBasebest practices

5.1 行键design

行键design is HBaseperformanceoptimization 关键, 以 under is 一些行键design best practices:

唯一性: 行键必须唯一, 用于标识一行data
long 度适in: 行键 long 度不宜过 long , 建议控制 in 10-100字节之间
避免热点issues: 行键 before 缀应该分布均匀, 避免 big 量request集in in few 数Region on
has 序性: 行键 is has 序store , 可以利用这一featuresdesign行键, improvingqueryefficiency
组合行键: for 于 complex query, 可以using组合行键, such as时间戳+ID

5.2 列族design

数量适in: 列族数量不宜过 many , 建议不超过3个
data访问模式: 将具 has 相似访问模式列放 in 同一个列族in
data big small : 避免 in 一个列族instore big 量data, 建议每个列族 data big small 控制 in 合理范围 in

5.3 Regiondesign

预partition: creation表时for预partition, 避免热点issues
Region big small : 合理设置Region big small , 建议 for 10-50GB
Regionmerge: 定期merge small Region, improvingqueryefficiency

5.4 performanceoptimization

启用BlockCache: improving读performance
调整MemStore big small : 根据practicalcircumstances调整MemStore big small
using批量operation: using批量Put, Get and Deleteoperation, reducingRPC调用
启用压缩: for HFilefor压缩, reducingstore空间 and I/O
合理设置WAL: 根据datareliability要求, 选择合适 WAL级别

实践练习

练习1: HBaseinstallation and configuration

installation并configurationHBase, using out 部Zookeeper.

练习2: HBase Shelloperation

usingHBase Shellcompletion以 under operation:

creation一个名 for "students" 表, package含"info" and "grades"两个列族
向表in插入3条学生记录, package含姓名, 年龄, 性别, 数学成绩 and 英语成绩
query其in一个学生所 has information
query所 has 学生姓名 and 年龄
update其in一个学生数学成绩
delete其in一个学生英语成绩
delete其in一个学生所 has 记录
delete该表

练习3: Java APIoperation

writingJava程序, usingHBase Java APIcompletion以 under operation:

creation一个名 for "products" 表, package含"details" and "inventory"两个列族
向表in插入5条产品记录
query所 has 产品名称 and 价格
update其in一个产品 library存
delete该表

6. summarized

本tutorial深入介绍了HBasedistributeddatalibrary basicconcepts, architecturedesign and usingmethod. HBaseserving as建立 in Hadoop之 on distributed列式datalibrary, 具 has high reliability, high scale性 and 实时读写capacity, 适合processing海量data.

HBase corecomponentincludingHMaster, RegionServer, Zookeeper and HDFS. HMaster负责management整个cluster 元data and 协调工作, RegionServer负责processing客户端读写request and managementRegion, Zookeeper用于协调cluster, HDFS用于storedatafile and logfile.

HBase datamodel is a 稀疏 , distributed , 持久化 many 维 has 序map, support动态列 and many versiondata. HBaseproviding了 many 种交互方式, includingHBase Shell, Java API and REST APIetc..

in usingHBase时, 需要注意行键design, 列族design and Regiondesignetc.best practices, 以improvingHBase performance and reliability. HBase适合processing海量data 实时读写场景, such asloganalysis, 时序data, user画像etc..

on 一课: Hivedata仓libraryBasics 返回tutoriallist under 一课: Zookeeperdistributed协调service