Skip to content

Data processing ang ingestion backend for ViyaDB based on Spark streaming

License

Notifications You must be signed in to change notification settings

viyadb/viyadb-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

viyadb-spark

Data processing backend (indexer) for ViyaDB based on Spark.

Build Status

Coverage

There are two processes defined in this project:

  • Streaming process
  • Batch process

Streaming process reads events in real-time, pre-aggregates them, and dumps loadable into ViyaDB TSV files to a deep storage. Batch process creates historical view of data containing events from previous batch plus events created afterwards in the streaming process.

The process can be graphically presented like this:

                                       +----------------+
                                       |                |
                                       |  Streaming Job |
                                       |                |
                                       +----------------+
                                               |
                                               |  writes current events
                                               v
         +------------------+         +--------+---------+
         | Previous Period  |         | Current Period   |
         | Real-Time Events |--+      | Real-Time Events |
         +------------------+  |      +------------------+
                               |
         +------------------+  |      +------------------+
         | Historical       |  |      | Historical       |
         | Events           |  |      | Events           |
         +------------------+  |      +------------------+      ...
            |                  |                   ^
 -----------|------------------|-------------------|----------------------------->
            |                  |                   |                Timeline
            |                  v                   |
            |              +-------------+         |
            |              |             |         |  unions previous period events
            +------------> |  Batch Job  |---------+  with all the historical events
                           |             |            that existed before
                           +-------------+

Features

Real-time Process

Real-time process responsibility:

  • Read data from a source (for now only Kafka support is provided as part of the code, but it can be easily extended), and parse it
  • Aggregate events by configured time window
  • Generate data loadable by ViyaDB (TSV format)

Batch Process

Batch process does the following:

  • Reads events that were generated by the real-time process
  • Optionally, clean the dataset out from irrelevant events
  • Aggregate the dataset
  • Partition the data to equal parts in terms of data size (aggregated rows number), and write these partitions back to historical storage

Prerequisites

Consul is used for storing configuration as well as for synchronizing different parts that ViyaDB cluster consists of. For running either real-time or batch processes the following configurations must present in Consul:

  • <consul prefix>/tables/<table name>/config - Table configuration
  • <consul prefix>/indexers/<indexer ID>/config - Indexer configuration

Building

mvn package

Running

spark-submit --class <jobClass> target/viyadb-spark_2.11-0.1.0-uberjar.jar \
    --consul-host "<consul host>" --consul-prefix "viyadb" \
    --indexer-id "<indexer ID>"

To run streaming job use com.github.viyadb.spark.streaming.Job for jobClass, to run batch job use com.github.viyadb.spark.batch.Job.

To see all available options, run:

spark-submit --class <jobClass> target/viyadb-spark_2.11-0.1.0-uberjar.jar --help