Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

Secondary Indexing

Jesse Yates edited this page Oct 9, 2013 · 18 revisions

Secondary indexes are an orthogonal way to access data from its primary access path. In HBase, you have a single index that is lexicographically sorted on the primary row key. Access to records in any way other than the way the primary row is specified requires a full table scan, unless you have an index on the columns you are querying. Phoenix is particularly powerful in that we provide 'covered' indexes - we do not need to go back to the primary table once we have found the index entry. Instead, we bundle the data we care about right in the index rows, saving read-time overhead.

Phoenix supports two main forms of indexing: immutable and mutable indexing. They are useful in different scenarios and have their own failure profiles and performance characteristics. Both indexes are 'global' indexes - they live on their own tables and are copies of primary table data, which Phoenix ensures remain in-sync.

Immutable Indexing

Immutable indexing is supported for rows that are write once; this is common in time-series data, where you log once, but read it multiple times. In this case, the indexing is managed entirely on the client - either we successfully write all the primary and index data or we return a failure to the client. Since the data is immutable, the client can just retry in case of failures.

Mutable Indexing

Often, the rows you are inserting are changing - pretty much any time you are not doing time-series data. In this case, we cannot rely on the client for retries as the data they would insert later is likely to be different (user goes away, external sources are ephemeral, etc.). Instead, we delegate this work to a RegionObserverCoprocessor that ensures all the data is written to the primary and index tables.

All the penalties for indexes occur at write time. We intercept the primary table updates on write (HTable#delete,HTable#put and HTable#batch), build the index update and then sent any necessary updates to all interested index tables. At read time, Phoenix will select the correct index table to use and can just directly scan it like any other HBase table.

NOTE: We do not support increment or append operations at this time in the generic indexing framework. This does not impact Phoenix, which currently only supports HBase Put and Delete operations.

Data Guarantees and Failure Management

On successful return to the client, all data is guaranteed to be written to all interested indexes and the primary table. Updates are an all-or-nothing proposal, with a small gap of being behind. From the perspective of a single client, it either thinks all-or-none of the update worked.

We maintain index update durability by adding the index updates to the Write-Ahead-Log (WAL) entry of the primary table row. Only after the WAL entry is successfully synced to disk do we attempt to make the index/primary table updates. We write the index updates in parallel by default, leading to very high throughput. If the server crashes while we are writing the index updates, we replay the all the index updates to the index tables in the WAL recovery proces and rely on the idempotence of the updates to ensure correctness. Therefore, index tables are only every a single edit ahead of the primary table.

Its important to note several points:

  • we do not provide full transactions so you could see the index table out of sync with the primary table.
  • As noted above, this is ok as we are only a very small bit ahead and out of sync for very short periods
  • all data is guaranteed to to be written or lost - we never see partial updates
  • all data is first written to index tables before the primary table

Singular Write Path

There is a single write path that guarantees the failure properties. All writes to the HRegion get intercepted by our coprocessor. We then build the index updates based on the pending update (or updates, in the case of the batch). These update are then appended to the WAL entry for the original update.

If we get any failure up to this point, we return the failure to the client and no data is persisted or made visible to the client.

Once the WAL is written, we ensure that the index and primary table data will become visible, even in the case of a failure.

  • If the server does not crash, we just insert the index updates to their respective tables.
  • If the server does crash, we then replay the index updates with the usual WAL replay mechanism ** If any of the index updates fails, we then fail the server, ensuring we get the WAL replay of the updates later.

Enforcing WAL Replay

Currently, there is a single failure policy for managing index write failures, the Kill Server On Failure Policy. If any of the index updates fails, this policy will attempt to abort the server immediately. If the abort fails, we call System.exit on the JVM, forcing the server to die. By killing the server, we ensure that the WAL will be replayed on recovery, replaying the index updates to their appropriate tables.

WARNING: indexing could bring down your entire cluster very quickly.

If the index tables are not setup correctly (Phoenix ensures that they are), this failure policy can cause a cascading failure as each regionserver attempts and fails to write the index update, subsequently killing itself to ensure the visibility concerns outlined above.

We expect to support other failure policies in the future, for instance a 'Mark Index as Invalid' policy that forces the user to rebuild in index in case of failure, rather than just killing the server.

Setup

Only mutable indexing requires special configuration options in the regionserver to run - phoenix ensures that they are setup correctly when you enable mutable indexing on the tabler; if the correct properties are not set, you will not be able to turn it on.

You will need to add the following parameters to hbase-site.xml:

<property>
  <name>hbase.regionserver.wal.codec</name>
  <value>org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec</value>
</property>

This enables custom WAL edits to be written, ensuring proper writing/replay of the index updates. This codec supports the usual host of WALEdit options, most notably WALEdit compression.

Tuning

Out the box, indexing is pretty fast. However, to optimize for your particular environment and workload, there are several properties you can tune.

Mutable Indexing

All the following parameters must be set in hbase-site.xml - they are true for the entire cluster and all index tables, as well as across all regions on the same server (so, for instance, a single server would not write to too many different index tables at once).

  1. index.builder.threads.max
  • Number of threads to used to build the index update from the primary table update
  • Increasing this value overcomes the bottleneck of reading the current row state from the underlying HRegion. Tuning this value too high will just bottleneck at the HRegion as it will not be able to handle too many concurrent scan requests as well as general thread-swapping concerns.
  • Default: 10
  1. index.builder.threads.keepalivetime
  • Amount of time in seconds after we expire threads in the builder thread pool.
  • Unused threads are immediately released after this amount of time and not core threads are retained (though this last is a small concern as tables are expected to sustain a fairly constant write load), but simultaneously allows us to drop threads if we are not seeing the expected load.
  • Default: 60
  1. index.writer.threads.max
  • Number of threads to use when writing to the target index tables.
  • The first level of parallelization, on a per-table basis - it should roughly correspond to the number of index tables
  • Default: 10
  1. index.writer.threads.keepalivetime
  • Amount of time in seconds after we expire threads in the writer thread pool.
  • Unused threads are immediately released after this amount of time and not core threads are retained (though this last is a small concern as tables are expected to sustain a fairly constant write load), but simultaneously allows us to drop threads if we are not seeing the expected load.
  • Default: 60
  1. hbase.htable.threads.max
  • Number of threads each index HTable can use for writes.
  • Increasing this allows more concurrent index updates (for instance across batches), leading to high overall throughput.
  • Default: 2,147,483,647
  1. hbase.htable.threads.keepalivetime
  • Amount of time in seconds after we expire threads in the HTable's thread pool.
  • Using the "direct handoff" approach, new threads will only be created if it is necessary and will grow unbounded. This could be bad but HTables only create as many Runnables as there are region servers; therefore, it also scales when new region servers are added.
  • Default: 60
  1. index.tablefactory.cache.size
  • Number of index HTables we should keep in cache.
  • Increasing this number ensures that we do not need to recreate an HTable for each attempt to write to an index table. Conversely, you could see memory pressure if this value is set too high.
  • Default: 10

Performance

We track secondary index performance via our performance framework. This is a generic test of performance based on defaults - your results will vary based on hardware specs as well as you individual configuration.

That said, we have seen secondary indexing (both immutable and mutable) go as quickly as < 2x the regular write path on a small, (3 node) desktop-based cluster. This is actually a phenomenal as we have to write to multiple tables as well as build the index update.

Presentations

There have been several presentations given on how secondary indexing works in Phoenix that have a more indepth look at how indexing works (with pretty pictures!):

Clone this wiki locally