This repository has been archived by the owner on Feb 12, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 227
Add hbase-stats project to contrib/ #131
Open
jyates
wants to merge
1
commit into
forcedotcom:master
Choose a base branch
from
jyates:hbase-stat
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,143 @@ | ||
# hbase-stat | ||
============= | ||
|
||
A simple statistics package for HBase. | ||
|
||
## Goal | ||
======== | ||
|
||
Provide reasonable approximations (exact, if we can manage it) of statistical information about a table in HBase with minimal muss and fuss. | ||
|
||
We want to make it easy to gather statistics about your HBase tables - there should be little to no work on the users part beyond ensuring that the right things are setup. | ||
|
||
## Usage | ||
========= | ||
|
||
###Cluster setup | ||
|
||
The only changes that need to be made to a generic configuration for a cluster is to add the RemoveTableOnDelete coprocessor to the list of Master Observers. This coprocessor cleans up the statistics for a table on delete, if that table has statistics 'enbabled' (see below). You should only need to add the following to your hbase-site.xml: | ||
|
||
``` | ||
<property> | ||
<name>hbase.coprocessor.master.classes</name> | ||
<value>com.salesforce.hbase.stats.cleanup.RemoveTableOnDelete</value> | ||
</property> | ||
``` | ||
|
||
### Table creation | ||
|
||
All the work for gathering and cleaning statistics is handled via coprocessors. Generally, each statistic will have its own static methods for adding the coprocessor to the table (if not provided, the HTableDesriptor#addCoprocessor() method should suffice). For instance, to add the MinMaxKey statistic to a table, all you would do is: | ||
|
||
```java | ||
HTableDescriptor primary = … | ||
MinMaxKey.addToTable(primary) | ||
``` | ||
|
||
At the very least, you should ensure that the table is created with the com.salesforce.hbase.stats.cleanup.RemoveRegionOnSplit coprocessor to ensure that when a region is removed (via splits or merges) that the stats for that region are also removed. This can be added manually (no recommended) or via the general setup table utility: | ||
|
||
```java | ||
HTableDescriptor primary = new HTableDescriptor("primary"); | ||
primary.addFamily(new HColumnDescriptor(FAM)); | ||
|
||
// ... | ||
//Add your own stats here | ||
//... | ||
|
||
// setup the stats table | ||
HBaseAdmin admin = UTIL.getHBaseAdmin(); | ||
//ensure statistics are enabled and the cleanup coprocessors setup | ||
SetupTableUtil.setupTable(admin, primary, true, false); | ||
``` | ||
|
||
#### SetupTableUtil | ||
|
||
In addition to settting up the cleanup coprocessors, the SetupTableUtil sets the 'stats enabled' flag in the primary table's descriptor. If this flag is enabled, the cleanup coprocessors (RemoveTableOnDelete and RemoveRegionOnSplit) will be enabled for the table | ||
|
||
* NOTE: if the cleanup coprocessors are not added to the table, setting the 'stats enabled' flag manually won't do anything. However, if you manually add the cleanup coprocessors, but don't enable stats on the descriptor, again, no cleanup will take place. It's highly recommended to use the SetupTableUtil to ensure you don't forget either side. | ||
|
||
You will also note that the SetupTableUtil has an option to ensure that the Statistics table is setup, *its highly recommneded that you use this option* to avoid accidentially forgetting and not having a statistics table when you go to write out statistics. With the wrong write configurations in hbase-site.xml, this could cause the statitic coprocessors to each block until they realize the table doesn't exist. | ||
|
||
To use it, you simply do the same as above, but ensure that the "ensureStatsTable" boolean flag is set: | ||
|
||
```java | ||
SetupTableUtil.setupTable(admin, primary, true /* this flag! */, false); | ||
``` | ||
|
||
### Reading Statistics | ||
|
||
Since statistics are kept in a single HTable, you could go any manually read them. However, each statistic could potentially have its own serialization and layout. Therefore, its recommended to the the StatisticReader to read a StatisticTable. Generally, all you will need to provide the StatisticReader is the type of statistic (Point or Histogram), name of the statistic and the underlying table. For instance, to read a histogram statistic "histo" for all the regions (and all the column families) of a table, you would do: | ||
|
||
```java | ||
StatisticsTable stats = … | ||
StatiticReader reader = new StatisticReader(stats, new HistogramStatisticDeserializer(), "histo"); | ||
reader.read() | ||
``` | ||
|
||
However, this is a bit of a pain as each statistic will have its own name and type. Therefore, the standard convention is for each StatisticTracker to provide a getStatisticReader(StatisticTable) method to read that statistic from the table. For instance, to read the EqualWidthHistogramStatistic, all you need to do is: | ||
|
||
```java | ||
StatisticsTable stats = … | ||
StatiticReader reader = EqualWidthHistogramStatistic.getStatisticsReader(stats); | ||
reader.read(); | ||
``` | ||
|
||
Some statistics are a little more complicated in the way they store their information, for instance using different column qualifiers at the same time to store different parts of the key. Generally, these should provide their own mechanisms to rebuild a stat from the serialized information. For instance, MinMaxKey provides an interpret method: | ||
|
||
```java | ||
StatisticsTable stats = … | ||
StatisticReader<StatisticValue> reader = MinMaxKey.getStatistcReader(stats); | ||
List<MinMaxStat> results = MinMaxKey.interpret(reader.read()); | ||
``` | ||
|
||
|
||
### Statistics Table Schema | ||
=========================== | ||
|
||
The schema was inspired by OpenTSDB (opentsdb.net) where each statistic is first grouped by table, then region. After that, each statistic (MetricValue) is grouped by: | ||
* type | ||
* info | ||
** this is like the sub-type of the metric to help describe the actual type. For instance, on a min/max for the column, this could be 'min' | ||
* value | ||
|
||
Suppose that we have a table called 'primary' with column 'col' and we are using the MinMaxKey statistic. Assuming the table has a single region, entries in the statistics table will look something like: | ||
|
||
``` | ||
| Row | Column Family | Column Qualifier | Value | ||
| primary<region name>col | STAT | max_region_key | 10 | ||
| primary<region name>col | STAT | min_region_key | 3 | ||
``` | ||
|
||
This is because the MinMaxKey statistic uses the column name (in this case 'col') as the type, we use the only CF on the stats table (STATS) and have to subtypes - info - elements: max_region_key and min_region_key, each with associated values. | ||
|
||
## Requirements | ||
=============== | ||
|
||
* Java 1.6.0_34 or higher | ||
* HBase-0.94.5 or higher | ||
|
||
### If building from source | ||
* Maven 3.X | ||
|
||
|
||
## Building from source | ||
======================= | ||
|
||
From the base (hbase-stat) directory… | ||
|
||
To run tests | ||
|
||
$ mvn clean test | ||
|
||
To build a jar | ||
|
||
$ mvn clean package | ||
|
||
and then look in the target/ directory for the build jar | ||
|
||
## Roadmap / TODOs | ||
================== | ||
- Switch statistic cleanup to use a coprocessor based delete | ||
- we want to delete an entire prefix, but that first requires doing a scan and then deleting everything back from the scan | ||
|
||
- Enable more fine-grained writing of statistics so different serialization mechanisms can be inserted. | ||
- most of the plumbing is already there (StatisticReader/Writer), but need to work it into the cleaner mechanisms |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,215 @@ | ||
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> | ||
<modelVersion>4.0.0</modelVersion> | ||
<groupId>com.salesforce.hbase</groupId> | ||
<artifactId>hbase-stat</artifactId> | ||
<version>0.0.1-SNAPSHOT</version> | ||
<name>HBase Stat</name> | ||
<description>A simple statistics package for HBase</description> | ||
|
||
<repositories> | ||
<repository> | ||
<id>apache release</id> | ||
<url>https://repository.apache.org/content/repositories/releases/</url> | ||
</repository> | ||
<repository> | ||
<id>apache non-releases</id> | ||
<name>Apache non-releases</name> | ||
<url>http://people.apache.org/~stack/m2/repository</url> | ||
<snapshots> | ||
<enabled>false</enabled> | ||
</snapshots> | ||
<releases> | ||
<enabled>true</enabled> | ||
</releases> | ||
<!-- Needed to start HBase in tests --> | ||
</repository> | ||
<repository> | ||
<id>codehaus</id> | ||
<name>Codehaus Public</name> | ||
<url>http://repository.codehaus.org/</url> | ||
<snapshots> | ||
<enabled>false</enabled> | ||
</snapshots> | ||
<releases> | ||
<enabled>true</enabled> | ||
</releases> | ||
</repository> | ||
</repositories> | ||
|
||
<properties> | ||
<hbase.version>0.94.5</hbase.version> | ||
<hadoop.version>1.0.4</hadoop.version> | ||
<jackson.version>1.8.8</jackson.version> | ||
<guava.version>12.0.1</guava.version> | ||
<!-- Test properties --> | ||
<mockito-all.version>1.8.5</mockito-all.version> | ||
<junit.version>4.10</junit.version> | ||
<test.timeout>900</test.timeout> | ||
<test.output.tofile>true</test.output.tofile> | ||
<!-- Plugin versions --> | ||
<surefire.version>2.14</surefire.version> | ||
</properties> | ||
|
||
<dependencies> | ||
<dependency> | ||
<groupId>com.google.guava</groupId> | ||
<artifactId>guava</artifactId> | ||
<version>${guava.version}</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.apache.hadoop</groupId> | ||
<artifactId>hadoop-core</artifactId> | ||
<version>${hadoop.version}</version> | ||
<optional>true</optional> | ||
<exclusions> | ||
<exclusion> | ||
<groupId>hsqldb</groupId> | ||
<artifactId>hsqldb</artifactId> | ||
</exclusion> | ||
<exclusion> | ||
<groupId>net.sf.kosmosfs</groupId> | ||
<artifactId>kfs</artifactId> | ||
</exclusion> | ||
<exclusion> | ||
<groupId>org.eclipse.jdt</groupId> | ||
<artifactId>core</artifactId> | ||
</exclusion> | ||
<exclusion> | ||
<groupId>net.java.dev.jets3t</groupId> | ||
<artifactId>jets3t</artifactId> | ||
</exclusion> | ||
<exclusion> | ||
<groupId>oro</groupId> | ||
<artifactId>oro</artifactId> | ||
</exclusion> | ||
</exclusions> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.apache.hbase</groupId> | ||
<artifactId>hbase</artifactId> | ||
<version>${hbase.version}</version> | ||
<!-- Things HBase needs...for some reason --> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.codehaus.jackson</groupId> | ||
<artifactId>jackson-core-asl</artifactId> | ||
<version>${jackson.version}</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.codehaus.jackson</groupId> | ||
<artifactId>jackson-mapper-asl</artifactId> | ||
<version>${jackson.version}</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.codehaus.jackson</groupId> | ||
<artifactId>jackson-jaxrs</artifactId> | ||
<version>${jackson.version}</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.codehaus.jackson</groupId> | ||
<artifactId>jackson-xc</artifactId> | ||
<version>${jackson.version}</version> | ||
</dependency> | ||
|
||
<!-- Test Dependencies --> | ||
<dependency> | ||
<groupId>org.apache.hbase</groupId> | ||
<artifactId>hbase</artifactId> | ||
<version>${hbase.version}</version> | ||
<type>test-jar</type> | ||
<scope>test</scope> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.apache.hadoop</groupId> | ||
<artifactId>hadoop-test</artifactId> | ||
<version>${hadoop.version}</version> | ||
<optional>true</optional> | ||
<scope>test</scope> | ||
</dependency> | ||
<dependency> | ||
<groupId>junit</groupId> | ||
<artifactId>junit</artifactId> | ||
<version>${junit.version}</version> | ||
<scope>test</scope> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.mockito</groupId> | ||
<artifactId>mockito-all</artifactId> | ||
<version>${mockito-all.version}</version> | ||
<scope>test</scope> | ||
</dependency> | ||
</dependencies> | ||
|
||
<build> | ||
<pluginManagement> | ||
<plugins> | ||
<plugin> | ||
<groupId>org.apache.maven.plugins</groupId> | ||
<artifactId>maven-compiler-plugin</artifactId> | ||
<version>3.0</version> | ||
<configuration> | ||
<source>1.6</source> | ||
<target>1.6</target> | ||
</configuration> | ||
</plugin> | ||
</plugins> | ||
</pluginManagement> | ||
<plugins> | ||
<!-- Run with -Dmaven.test.skip.exec=true to build -tests.jar without running | ||
tests (this is needed for upstream projects whose tests need this jar simply for | ||
compilation) --> | ||
<plugin> | ||
<groupId>org.apache.maven.plugins</groupId> | ||
<artifactId>maven-jar-plugin</artifactId> | ||
<version>2.4</version> | ||
<executions> | ||
<execution> | ||
<phase>prepare-package | ||
</phase> | ||
<goals> | ||
<goal>test-jar</goal> | ||
</goals> | ||
</execution> | ||
</executions> | ||
<configuration> | ||
<!-- Exclude these 2 packages, because their dependency _binary_ files | ||
include the sources, and Maven 2.2 appears to add them to the sources to compile, | ||
weird --> | ||
<excludes> | ||
<exclude>org/apache/jute/**</exclude> | ||
<exclude>org/apache/zookeeper/**</exclude> | ||
<exclude>**/*.jsp</exclude> | ||
<exclude>log4j.properties</exclude> | ||
</excludes> | ||
</configuration> | ||
</plugin> | ||
<!-- Make a source jar --> | ||
<plugin> | ||
<groupId>org.apache.maven.plugins</groupId> | ||
<artifactId>maven-source-plugin</artifactId> | ||
<version>2.2.1</version> | ||
<executions> | ||
<execution> | ||
<id>attach-sources</id> | ||
<phase>prepare-package</phase> | ||
<goals> | ||
<goal>jar-no-fork</goal> | ||
</goals> | ||
</execution> | ||
</executions> | ||
</plugin> | ||
<!-- Specialized configuration for running tests --> | ||
<plugin> | ||
<artifactId>maven-surefire-plugin</artifactId> | ||
<version>${surefire.version}</version> | ||
<configuration> | ||
<forkedProcessTimeoutInSeconds>${test.timeout}</forkedProcessTimeoutInSeconds> | ||
<argLine>-enableassertions -Xmx2048m | ||
-Djava.security.egd=file:/dev/./urandom</argLine> | ||
<redirectTestOutputToFile>${test.output.tofile}</redirectTestOutputToFile> | ||
</configuration> | ||
</plugin> | ||
</plugins> | ||
</build> | ||
</project> |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this example, would the row look like this?
Row Key Value
primary\0some-var-len-region-name\0min_region_key 3
primary\0some-var-len-region-name\0max_region_key 10
If a column in the PK is variable length, Phoenix expects it to be null terminated. Are region names variable length too?
One thing we'd be after is to be able to query the stats table through Phoenix. It'll definitely make debugging and troubleshooting easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would look like this:
primary\0some-var-len-region-name\0some-var-length-column-name | STAT | max_region_key 10
primary\0some-var-len-region-name\0some-var-length-column-name | STAT | min_region_key 3
Right now the stats reader/writer stuff handles reading it in (albeit is still a bit overly complicated IMO). I'd think we could move to using a phoenix based reader and writer the future when we have a configurable writer. I would want to do the configurable writer work in another patch though - that starts to get even more complicated than it already is