Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

Add hbase-stats project to contrib/ #131

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions contrib/hbase-stat/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# hbase-stat
=============

A simple statistics package for HBase.

## Goal
========

Provide reasonable approximations (exact, if we can manage it) of statistical information about a table in HBase with minimal muss and fuss.

We want to make it easy to gather statistics about your HBase tables - there should be little to no work on the users part beyond ensuring that the right things are setup.

## Usage
=========

###Cluster setup

The only changes that need to be made to a generic configuration for a cluster is to add the RemoveTableOnDelete coprocessor to the list of Master Observers. This coprocessor cleans up the statistics for a table on delete, if that table has statistics 'enbabled' (see below). You should only need to add the following to your hbase-site.xml:

```
<property>
<name>hbase.coprocessor.master.classes</name>
<value>com.salesforce.hbase.stats.cleanup.RemoveTableOnDelete</value>
</property>
```

### Table creation

All the work for gathering and cleaning statistics is handled via coprocessors. Generally, each statistic will have its own static methods for adding the coprocessor to the table (if not provided, the HTableDesriptor#addCoprocessor() method should suffice). For instance, to add the MinMaxKey statistic to a table, all you would do is:

```java
HTableDescriptor primary = …
MinMaxKey.addToTable(primary)
```

At the very least, you should ensure that the table is created with the com.salesforce.hbase.stats.cleanup.RemoveRegionOnSplit coprocessor to ensure that when a region is removed (via splits or merges) that the stats for that region are also removed. This can be added manually (no recommended) or via the general setup table utility:

```java
HTableDescriptor primary = new HTableDescriptor("primary");
primary.addFamily(new HColumnDescriptor(FAM));

// ...
//Add your own stats here
//...

// setup the stats table
HBaseAdmin admin = UTIL.getHBaseAdmin();
//ensure statistics are enabled and the cleanup coprocessors setup
SetupTableUtil.setupTable(admin, primary, true, false);
```

#### SetupTableUtil

In addition to settting up the cleanup coprocessors, the SetupTableUtil sets the 'stats enabled' flag in the primary table's descriptor. If this flag is enabled, the cleanup coprocessors (RemoveTableOnDelete and RemoveRegionOnSplit) will be enabled for the table

* NOTE: if the cleanup coprocessors are not added to the table, setting the 'stats enabled' flag manually won't do anything. However, if you manually add the cleanup coprocessors, but don't enable stats on the descriptor, again, no cleanup will take place. It's highly recommended to use the SetupTableUtil to ensure you don't forget either side.

You will also note that the SetupTableUtil has an option to ensure that the Statistics table is setup, *its highly recommneded that you use this option* to avoid accidentially forgetting and not having a statistics table when you go to write out statistics. With the wrong write configurations in hbase-site.xml, this could cause the statitic coprocessors to each block until they realize the table doesn't exist.

To use it, you simply do the same as above, but ensure that the "ensureStatsTable" boolean flag is set:

```java
SetupTableUtil.setupTable(admin, primary, true /* this flag! */, false);
```

### Reading Statistics

Since statistics are kept in a single HTable, you could go any manually read them. However, each statistic could potentially have its own serialization and layout. Therefore, its recommended to the the StatisticReader to read a StatisticTable. Generally, all you will need to provide the StatisticReader is the type of statistic (Point or Histogram), name of the statistic and the underlying table. For instance, to read a histogram statistic "histo" for all the regions (and all the column families) of a table, you would do:

```java
StatisticsTable stats = …
StatiticReader reader = new StatisticReader(stats, new HistogramStatisticDeserializer(), "histo");
reader.read()
```

However, this is a bit of a pain as each statistic will have its own name and type. Therefore, the standard convention is for each StatisticTracker to provide a getStatisticReader(StatisticTable) method to read that statistic from the table. For instance, to read the EqualWidthHistogramStatistic, all you need to do is:

```java
StatisticsTable stats = …
StatiticReader reader = EqualWidthHistogramStatistic.getStatisticsReader(stats);
reader.read();
```

Some statistics are a little more complicated in the way they store their information, for instance using different column qualifiers at the same time to store different parts of the key. Generally, these should provide their own mechanisms to rebuild a stat from the serialized information. For instance, MinMaxKey provides an interpret method:

```java
StatisticsTable stats = …
StatisticReader<StatisticValue> reader = MinMaxKey.getStatistcReader(stats);
List<MinMaxStat> results = MinMaxKey.interpret(reader.read());
```


### Statistics Table Schema
===========================

The schema was inspired by OpenTSDB (opentsdb.net) where each statistic is first grouped by table, then region. After that, each statistic (MetricValue) is grouped by:
* type
* info
** this is like the sub-type of the metric to help describe the actual type. For instance, on a min/max for the column, this could be 'min'
* value

Suppose that we have a table called 'primary' with column 'col' and we are using the MinMaxKey statistic. Assuming the table has a single region, entries in the statistics table will look something like:

```
| Row | Column Family | Column Qualifier | Value
| primary<region name>col | STAT | max_region_key | 10
| primary<region name>col | STAT | min_region_key | 3
```

This is because the MinMaxKey statistic uses the column name (in this case 'col') as the type, we use the only CF on the stats table (STATS) and have to subtypes - info - elements: max_region_key and min_region_key, each with associated values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this example, would the row look like this?

Row Key Value

primary\0some-var-len-region-name\0min_region_key 3
primary\0some-var-len-region-name\0max_region_key 10

If a column in the PK is variable length, Phoenix expects it to be null terminated. Are region names variable length too?

One thing we'd be after is to be able to query the stats table through Phoenix. It'll definitely make debugging and troubleshooting easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would look like this:

primary\0some-var-len-region-name\0some-var-length-column-name | STAT | max_region_key 10
primary\0some-var-len-region-name\0some-var-length-column-name | STAT | min_region_key 3

Right now the stats reader/writer stuff handles reading it in (albeit is still a bit overly complicated IMO). I'd think we could move to using a phoenix based reader and writer the future when we have a configurable writer. I would want to do the configurable writer work in another patch though - that starts to get even more complicated than it already is


## Requirements
===============

* Java 1.6.0_34 or higher
* HBase-0.94.5 or higher

### If building from source
* Maven 3.X


## Building from source
=======================

From the base (hbase-stat) directory…

To run tests

$ mvn clean test

To build a jar

$ mvn clean package

and then look in the target/ directory for the build jar

## Roadmap / TODOs
==================
- Switch statistic cleanup to use a coprocessor based delete
- we want to delete an entire prefix, but that first requires doing a scan and then deleting everything back from the scan

- Enable more fine-grained writing of statistics so different serialization mechanisms can be inserted.
- most of the plumbing is already there (StatisticReader/Writer), but need to work it into the cleaner mechanisms
215 changes: 215 additions & 0 deletions contrib/hbase-stat/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.salesforce.hbase</groupId>
<artifactId>hbase-stat</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>HBase Stat</name>
<description>A simple statistics package for HBase</description>

<repositories>
<repository>
<id>apache release</id>
<url>https://repository.apache.org/content/repositories/releases/</url>
</repository>
<repository>
<id>apache non-releases</id>
<name>Apache non-releases</name>
<url>http://people.apache.org/~stack/m2/repository</url>
<snapshots>
<enabled>false</enabled>
</snapshots>
<releases>
<enabled>true</enabled>
</releases>
<!-- Needed to start HBase in tests -->
</repository>
<repository>
<id>codehaus</id>
<name>Codehaus Public</name>
<url>http://repository.codehaus.org/</url>
<snapshots>
<enabled>false</enabled>
</snapshots>
<releases>
<enabled>true</enabled>
</releases>
</repository>
</repositories>

<properties>
<hbase.version>0.94.5</hbase.version>
<hadoop.version>1.0.4</hadoop.version>
<jackson.version>1.8.8</jackson.version>
<guava.version>12.0.1</guava.version>
<!-- Test properties -->
<mockito-all.version>1.8.5</mockito-all.version>
<junit.version>4.10</junit.version>
<test.timeout>900</test.timeout>
<test.output.tofile>true</test.output.tofile>
<!-- Plugin versions -->
<surefire.version>2.14</surefire.version>
</properties>

<dependencies>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>${guava.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>${hadoop.version}</version>
<optional>true</optional>
<exclusions>
<exclusion>
<groupId>hsqldb</groupId>
<artifactId>hsqldb</artifactId>
</exclusion>
<exclusion>
<groupId>net.sf.kosmosfs</groupId>
<artifactId>kfs</artifactId>
</exclusion>
<exclusion>
<groupId>org.eclipse.jdt</groupId>
<artifactId>core</artifactId>
</exclusion>
<exclusion>
<groupId>net.java.dev.jets3t</groupId>
<artifactId>jets3t</artifactId>
</exclusion>
<exclusion>
<groupId>oro</groupId>
<artifactId>oro</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase</artifactId>
<version>${hbase.version}</version>
<!-- Things HBase needs...for some reason -->
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-core-asl</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-mapper-asl</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-jaxrs</artifactId>
<version>${jackson.version}</version>
</dependency>
<dependency>
<groupId>org.codehaus.jackson</groupId>
<artifactId>jackson-xc</artifactId>
<version>${jackson.version}</version>
</dependency>

<!-- Test Dependencies -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase</artifactId>
<version>${hbase.version}</version>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-test</artifactId>
<version>${hadoop.version}</version>
<optional>true</optional>
<scope>test</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>${junit.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-all</artifactId>
<version>${mockito-all.version}</version>
<scope>test</scope>
</dependency>
</dependencies>

<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</pluginManagement>
<plugins>
<!-- Run with -Dmaven.test.skip.exec=true to build -tests.jar without running
tests (this is needed for upstream projects whose tests need this jar simply for
compilation) -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>2.4</version>
<executions>
<execution>
<phase>prepare-package
</phase>
<goals>
<goal>test-jar</goal>
</goals>
</execution>
</executions>
<configuration>
<!-- Exclude these 2 packages, because their dependency _binary_ files
include the sources, and Maven 2.2 appears to add them to the sources to compile,
weird -->
<excludes>
<exclude>org/apache/jute/**</exclude>
<exclude>org/apache/zookeeper/**</exclude>
<exclude>**/*.jsp</exclude>
<exclude>log4j.properties</exclude>
</excludes>
</configuration>
</plugin>
<!-- Make a source jar -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
<version>2.2.1</version>
<executions>
<execution>
<id>attach-sources</id>
<phase>prepare-package</phase>
<goals>
<goal>jar-no-fork</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- Specialized configuration for running tests -->
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>${surefire.version}</version>
<configuration>
<forkedProcessTimeoutInSeconds>${test.timeout}</forkedProcessTimeoutInSeconds>
<argLine>-enableassertions -Xmx2048m
-Djava.security.egd=file:/dev/./urandom</argLine>
<redirectTestOutputToFile>${test.output.tofile}</redirectTestOutputToFile>
</configuration>
</plugin>
</plugins>
</build>
</project>
Loading