Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

Performance

mujtabachohan edited this page May 15, 2013 · 25 revisions

Phoenix follows the philosophy of bringing the computation to the data by using:

  • coprocessors to perform operations on the server-side thus minimizing client/server data transfer
  • custom filters to prune data as close to the source as possible

In addition, to minimize any startup costs, Phoenix uses native HBase APIs rather than going through the map/reduce framework.

Performance improvements in Phoenix 1.2

Essential Column Family

Phoenix query filter leverages [HBase Filter Essential Column Family](http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.html#isFamilyEssential(byte[]) feature which leads to improved performance when Phoenix query filters on data that is split in multiple column families (cf) by only loading essential cf. In second pass, all cf are are loaded as needed.

Consider the following schema in which data is split in two cf create table t (k varchar not null key, a.c1 integer, b.c2 varchar, b.c3 varchar, b.c4 varchar).

Running a query similar to the following shows significant performance when a subset of rows match filter select count(c2) from t where c1=1

Following chart shows query in-memory performance of running the above query with 10M rows on 4 region servers when 10% of the rows matches the filter. Note: cf-a is approx 8 bytes and cf-b is approx 400 bytes wide.

Ess. CF

Skip Scan

The Skip Scan leverages SEEK_NEXT_USING_HINT of HBase Filter. It significantly improves point queries over key columns.

Consider the following schema in which data is split in two cf create table t (k varchar not null key, a.c1 integer, b.c2 varchar, b.c3 varchar).

Running a query similar to the following shows significant performance when a subset of rows match filter select count(c1) from t where k in (1% random k's)

Following chart shows query in-memory performance of running the above query with 10M rows on 4 region servers when 1% random keys over the entire range passed in query IN clause. Note: all varchar columns are approx 15 bytes.

SkipScan

Salting

Salting leads to both improved read and write performance by adding an extra hash byte at start of key and pre-splitting data in number of regions. This eliminates hot-spotting of single or few regions servers. Read more about this feature here.

Consider the following schema in which table is split in 4 regions running on 4 region server cluster CREATE TABLE T (HOST CHAR(2) NOT NULL,DOMAIN VARCHAR NOT NULL, FEATURE VARCHAR NOT NULL,DATE DATE NOT NULL,USAGE.CORE BIGINT,USAGE.DB BIGINT,STATS.ACTIVE_VISITOR INTEGER CONSTRAINT PK PRIMARY KEY (HOST, DOMAIN, FEATURE, DATE)) SALT_BUCKET=4.

Following chart shows performance comparison of write performance gain with and without the use of Salting.

Salted


Phoenix vs related products

Below are charts showing relative performance between Phoenix and some other related products.

Phoenix vs Hive (running over HDFS and HBase)

Phoenix vs Hive
Query: select count(1) from table over 10M and 100M rows. Data is 5 narrow columns. Number of Region Servers: 4 (HBase heap: 10GB, Processor: 6 cores @ 3.3GHz Xeon)

Phoenix vs Impala (running over HBase)

Phoenix vs Impala
Query: select count(1) from table over 1M and 5M rows. Data is 3 narrow columns. Number of Region Server: 1 (Virtual Machine, HBase heap: 2GB, Processor: 2 cores @ 3.3GHz Xeon)

Phoenix vs OpenTSDB

Phoenix vs OpenTSDB
Query: select sum(value) from table. Data contains 15M datapoints. Number of Region Server: 1 (Local, HBase heap: 2GB, Processor: 6 cores @ 3.3GHz Xeon)

Clone this wiki locally