forcedotcom · jyates · Apr 5, 2013 · jtaylor-sfdc · Apr 16, 2013 · jyates
diff --git a/contrib/hbase-stat/README.md b/contrib/hbase-stat/README.md
@@ -0,0 +1,143 @@
+# hbase-stat
+=============
+
+A simple statistics package for HBase.
+
+## Goal
+========
+
+Provide reasonable approximations (exact, if we can manage it) of statistical information about a table in HBase with minimal muss and fuss.
+
+We want to make it easy to gather statistics about your HBase tables - there should be little to no work on the users part beyond ensuring that the right things are setup.
+
+## Usage
+=========
+
+###Cluster setup
+
+The only changes that need to be made to a generic configuration for a cluster is to add the RemoveTableOnDelete coprocessor to the list of Master Observers. This coprocessor cleans up the statistics for a table on delete, if that table has statistics 'enbabled' (see below). You should only need to add the following to your hbase-site.xml:
+
+```
+<property>
+	<name>hbase.coprocessor.master.classes</name>
+	<value>com.salesforce.hbase.stats.cleanup.RemoveTableOnDelete</value>
+</property>
+```
+
+### Table creation
+
+All the work for gathering and cleaning statistics is handled via coprocessors. Generally, each statistic will have its own static methods for adding the coprocessor to the table (if not provided, the HTableDesriptor#addCoprocessor() method should suffice). For instance, to add the MinMaxKey statistic to a table, all you would do is:
+
+```java
+	HTableDescriptor primary = …
+	MinMaxKey.addToTable(primary)
+```
+
+At the very least, you should ensure that the table is created with the com.salesforce.hbase.stats.cleanup.RemoveRegionOnSplit coprocessor to ensure that when a region is removed (via splits or merges) that the stats for that region are also removed. This can be added manually (no recommended) or via the general setup table utility:
+
+```java
+    HTableDescriptor primary = new HTableDescriptor("primary");
+    primary.addFamily(new HColumnDescriptor(FAM));
+
+    // ...
+    //Add your own stats here
+    //...
+
+    // setup the stats table
+    HBaseAdmin admin = UTIL.getHBaseAdmin();
+    //ensure statistics are enabled and the cleanup coprocessors setup
+    SetupTableUtil.setupTable(admin, primary, true, false);
+```
+
+#### SetupTableUtil
+
+In addition to settting up the cleanup coprocessors, the SetupTableUtil sets the 'stats enabled' flag in the primary table's descriptor. If this flag is enabled, the cleanup coprocessors (RemoveTableOnDelete and RemoveRegionOnSplit) will be enabled for the table
+
+ * NOTE: if the cleanup coprocessors are not added to the table, setting the 'stats enabled' flag manually won't do anything. However, if you manually add the cleanup coprocessors, but don't enable stats on the descriptor, again, no cleanup will take place. It's highly recommended to use the SetupTableUtil to ensure you don't forget either side.
+
+You will also note that the SetupTableUtil has an option to ensure that the Statistics table is setup, *its highly recommneded that you use this option* to avoid accidentially forgetting and not having a statistics table when you go to write out statistics. With the wrong write configurations in hbase-site.xml, this could cause the statitic coprocessors to each block until they realize the table doesn't exist.
+
+To use it, you simply do the same as above, but ensure that the "ensureStatsTable" boolean flag is set:
+
+```java
+    SetupTableUtil.setupTable(admin, primary, true /* this flag! */, false);
+```
+
+### Reading Statistics
+
+Since statistics are kept in a single HTable, you could go any manually read them. However, each statistic could potentially have its own serialization and layout. Therefore, its recommended to the the StatisticReader to read a StatisticTable. Generally, all you will need to provide the StatisticReader is the type of statistic (Point or Histogram), name of the statistic and the underlying table. For instance, to read a histogram statistic "histo" for all the regions (and all the column families) of a table, you would do:
+
+```java
+	StatisticsTable stats = …
+	StatiticReader reader = new StatisticReader(stats, new HistogramStatisticDeserializer(), "histo");
+	reader.read()
+```
+
+However, this is a bit of a pain as each statistic will have its own name and type. Therefore, the standard convention is for each StatisticTracker to provide a getStatisticReader(StatisticTable) method to read that statistic from the table. For instance, to read the EqualWidthHistogramStatistic, all you need to do is:
+
+```java
+	StatisticsTable stats = …
+	StatiticReader reader = EqualWidthHistogramStatistic.getStatisticsReader(stats);
+	reader.read();
+```
+
+Some statistics are a little more complicated in the way they store their information, for instance using different column qualifiers at the same time to store different parts of the key. Generally, these should provide their own mechanisms to rebuild a stat from the serialized information. For instance, MinMaxKey provides an interpret method:
+
+```java
+    StatisticsTable stats = …
+    StatisticReader<StatisticValue> reader = MinMaxKey.getStatistcReader(stats);
+    List<MinMaxStat> results = MinMaxKey.interpret(reader.read());
+```
+
+
+### Statistics Table Schema
+===========================
+
+The schema was inspired by OpenTSDB (opentsdb.net) where each statistic is first grouped by table, then region. After that, each statistic (MetricValue) is grouped by:
+	* type
+	* info
+	** this is like the sub-type of the metric to help describe the actual type. For instance, on a min/max for the column, this could be 'min'
+	* value
+
+Suppose that we have a table called 'primary' with column 'col' and we are using the MinMaxKey statistic. Assuming the table has a single region, entries in the statistics table will look something like:
+
+```
+|           Row             | Column Family | Column Qualifier | Value 
+|  primary<region name>col  |     STAT      |   max_region_key |  10  
+|  primary<region name>col  |     STAT      |   min_region_key |  3
+```
+
+This is because the MinMaxKey statistic uses the column name (in this case 'col') as the type, we use the only CF on the stats table (STATS) and have to subtypes - info - elements: max_region_key and min_region_key, each with associated values.
+
+## Requirements
+===============
+
+* Java 1.6.0_34 or higher
+* HBase-0.94.5 or higher
+
+### If building from source
+* Maven 3.X
+
+
+## Building from source
+=======================
+
+From the base (hbase-stat) directory…
+
+To run tests
+
+    $ mvn clean test
+
+To build a jar
+
+    $ mvn clean package
+
+and then look in the target/ directory for the build jar
+
+## Roadmap / TODOs
+==================
+ - Switch statistic cleanup to use a coprocessor based delete
+ 	- we want to delete an entire prefix, but that first requires doing a scan and then deleting everything back from the scan
+
+ - Enable more fine-grained writing of statistics so different serialization mechanisms can be inserted.
+ 	- most of the plumbing is already there (StatisticReader/Writer), but need to work it into the cleaner mechanisms
diff --git a/contrib/hbase-stat/pom.xml b/contrib/hbase-stat/pom.xml
@@ -0,0 +1,215 @@
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+  <modelVersion>4.0.0</modelVersion>
+  <groupId>com.salesforce.hbase</groupId>
+  <artifactId>hbase-stat</artifactId>
+  <version>0.0.1-SNAPSHOT</version>
+  <name>HBase Stat</name>
+  <description>A simple statistics package for HBase</description>
+
+  <repositories>
+    <repository>
+      <id>apache release</id>
+      <url>https://repository.apache.org/content/repositories/releases/</url>
+    </repository>
+    <repository>
+      <id>apache non-releases</id>
+      <name>Apache non-releases</name>
+      <url>http://people.apache.org/~stack/m2/repository</url>
+      <snapshots>
+        <enabled>false</enabled>
+      </snapshots>
+      <releases>
+        <enabled>true</enabled>
+      </releases>
+      <!-- Needed to start HBase in tests -->
+    </repository>
+    <repository>
+      <id>codehaus</id>
+      <name>Codehaus Public</name>
+      <url>http://repository.codehaus.org/</url>
+      <snapshots>
+        <enabled>false</enabled>
+      </snapshots>
+      <releases>
+        <enabled>true</enabled>
+      </releases>
+    </repository>
+  </repositories>
+
+  <properties>
+    <hbase.version>0.94.5</hbase.version>
+    <hadoop.version>1.0.4</hadoop.version>
+    <jackson.version>1.8.8</jackson.version>
+    <guava.version>12.0.1</guava.version>
+    <!-- Test properties -->
+    <mockito-all.version>1.8.5</mockito-all.version>
+    <junit.version>4.10</junit.version>
+    <test.timeout>900</test.timeout>
+    <test.output.tofile>true</test.output.tofile>
+    <!-- Plugin versions -->
+    <surefire.version>2.14</surefire.version>
+  </properties>
+
+  <dependencies>
+    <dependency>
+      <groupId>com.google.guava</groupId>
+      <artifactId>guava</artifactId>
+      <version>${guava.version}</version>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-core</artifactId>
+      <version>${hadoop.version}</version>
+      <optional>true</optional>
+      <exclusions>
+        <exclusion>
+          <groupId>hsqldb</groupId>
+          <artifactId>hsqldb</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>net.sf.kosmosfs</groupId>
+          <artifactId>kfs</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>org.eclipse.jdt</groupId>
+          <artifactId>core</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>net.java.dev.jets3t</groupId>
+          <artifactId>jets3t</artifactId>
+        </exclusion>
+        <exclusion>
+          <groupId>oro</groupId>
+          <artifactId>oro</artifactId>
+        </exclusion>
+      </exclusions>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.hbase</groupId>
+      <artifactId>hbase</artifactId>
+      <version>${hbase.version}</version>
+      <!-- Things HBase needs...for some reason -->
+    </dependency>
+    <dependency>
+      <groupId>org.codehaus.jackson</groupId>
+      <artifactId>jackson-core-asl</artifactId>
+      <version>${jackson.version}</version>
+    </dependency>
+    <dependency>
+      <groupId>org.codehaus.jackson</groupId>
+      <artifactId>jackson-mapper-asl</artifactId>
+      <version>${jackson.version}</version>
+    </dependency>
+    <dependency>
+      <groupId>org.codehaus.jackson</groupId>
+      <artifactId>jackson-jaxrs</artifactId>
+      <version>${jackson.version}</version>
+    </dependency>
+    <dependency>
+      <groupId>org.codehaus.jackson</groupId>
+      <artifactId>jackson-xc</artifactId>
+      <version>${jackson.version}</version>
+    </dependency>
+
+    <!-- Test Dependencies -->
+    <dependency>
+      <groupId>org.apache.hbase</groupId>
+      <artifactId>hbase</artifactId>
+      <version>${hbase.version}</version>
+      <type>test-jar</type>
+      <scope>test</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-test</artifactId>
+      <version>${hadoop.version}</version>
+      <optional>true</optional>
+      <scope>test</scope>
+    </dependency>
+    <dependency>
+      <groupId>junit</groupId>
+      <artifactId>junit</artifactId>
+      <version>${junit.version}</version>
+      <scope>test</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.mockito</groupId>
+      <artifactId>mockito-all</artifactId>
+      <version>${mockito-all.version}</version>
+      <scope>test</scope>
+    </dependency>
+  </dependencies>
+
+  <build>
+    <pluginManagement>
+      <plugins>
+        <plugin>
+          <groupId>org.apache.maven.plugins</groupId>
+          <artifactId>maven-compiler-plugin</artifactId>
+          <version>3.0</version>
+          <configuration>
+            <source>1.6</source>
+            <target>1.6</target>
+          </configuration>
+        </plugin>
+      </plugins>
+    </pluginManagement>
+    <plugins>
+      <!-- Run with -Dmaven.test.skip.exec=true to build -tests.jar without running 
+        tests (this is needed for upstream projects whose tests need this jar simply for 
+        compilation) -->
+      <plugin>
+        <groupId>org.apache.maven.plugins</groupId>
+        <artifactId>maven-jar-plugin</artifactId>
+        <version>2.4</version>
+        <executions>
+          <execution>
+            <phase>prepare-package
+            </phase>
+            <goals>
+              <goal>test-jar</goal>
+            </goals>
+          </execution>
+        </executions>
+        <configuration>
+          <!-- Exclude these 2 packages, because their dependency _binary_ files 
+            include the sources, and Maven 2.2 appears to add them to the sources to compile, 
+            weird -->
+          <excludes>
+            <exclude>org/apache/jute/**</exclude>
+            <exclude>org/apache/zookeeper/**</exclude>
+            <exclude>**/*.jsp</exclude>
+            <exclude>log4j.properties</exclude>
+          </excludes>
+        </configuration>
+      </plugin>
+      <!-- Make a source jar -->
+      <plugin>
+        <groupId>org.apache.maven.plugins</groupId>
+        <artifactId>maven-source-plugin</artifactId>
+        <version>2.2.1</version>
+        <executions>
+          <execution>
+            <id>attach-sources</id>
+            <phase>prepare-package</phase>
+            <goals>
+              <goal>jar-no-fork</goal>
+            </goals>
+          </execution>
+        </executions>
+      </plugin>
+      <!-- Specialized configuration for running tests -->
+      <plugin>
+        <artifactId>maven-surefire-plugin</artifactId>
+        <version>${surefire.version}</version>
+        <configuration>
+          <forkedProcessTimeoutInSeconds>${test.timeout}</forkedProcessTimeoutInSeconds>
+          <argLine>-enableassertions -Xmx2048m
+              -Djava.security.egd=file:/dev/./urandom</argLine>
+          <redirectTestOutputToFile>${test.output.tofile}</redirectTestOutputToFile>
+        </configuration>
+      </plugin>
+    </plugins>
+  </build>
+</project>