SNAP 2710 - Updating code to use structured streaming #17

vatsalmevada · 2020-01-13T10:22:46Z

Following major changes are done:

updating code to use structured streaming
removed modules related to performance comparison with other products
(will have to re-run benchmarks with structured streaming code and such comparison in future)
cleaning up unused code

ad analytics. - Upgraded the producer to kafka 10 - removed async producer class as with new version of kafka by default the producer is async - cleaning up code which is not used anymore

to disable optimization rules related to constraint propagation. Cherry-picked from e011004bedca47be998a0c14fe22a6f9bb5090cd and resolved merge conflicts.

deserialization. Instead creating new instance of Deserializer per partition. - Also explicitly providing new instance of `SpecificData` while creating `SpecificDatumReader`. Without this the `SpecificDatumReader` internally uses a singleton instance of `SpecificData` which maintains cache of loaded classes. This can lead to `ClassCastException` as the DatumReader end up using cached classes which are loaded by different classloader.

instead of custom sink implementaion.

build.gradle

- cleanup - correction in readme - bumping up snappy version

nikhilbandi · 2020-02-24T07:52:12Z

README.md

- Enabling streaming, transactions and interactive analytics in a single unifying system rather than stitching different solutions—and
- Delivering true interactive speeds via a state-of-the-art approximate query engine that leverages a multitude of synopses as well as the full dataset. SnappyData implements this by deeply integrating an in-memory database into Apache Spark.
+[SnappyData](https://github.com/SnappyDataInc/snappydata) aims to deliver real time operational analytics at interactive
+speeds with commodity infrastructure and far less complexity than today.


"far less complexity than today" - looks a little vague. Can we be more specific as to what complexities are involved and how are we getting rid of those?

README.md

nikhilbandi · 2020-02-24T08:17:48Z

README.md

+incorporate more complex analytics, rather than using map-reduce).
+- Demonstrate storing the pre-aggregated logs into the SnappyData columnar store with high efficiency. While the store
+itself provides a rich set of features like hybrid row+column store, eager replication, WAN replicas, HA, choice of memory-only, HDFS, native disk persistence, eviction, etc we only work with a column table in this simple example.
+- Run OLAP queries from any SQL client both on the full data set as well as sampled data (showcasing sub-second


Should a link to the AQP documentation be provided here?

These are just formatting changes. In current master also AQP is not mentioned. Although down the line we have explained the AQP capabilities separately in the same readme file.

README.md

nikhilbandi · 2020-02-24T08:30:25Z

README.md

+are then aggregated using [Spark Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) into the SnappyData Store. External clients connect to the same cluster using JDBC/ODBC and run arbitrary OLAP queries.
+As AdServers can feed logs from many websites and given that each AdImpression log message represents a single Ad viewed
+by a user, one can expect thousands of messages every second. It is crucial that ingestion logic keeps up with the
+stream. To accomplish this, SnappyData collocates the store partitions with partitions created by Spark Streaming.


Would be better to mention briefly how does the collocation happen.

Won't be easy to explain collocation in brief. Will see whether we have any link in documentation which we can provide here.

README.md

nikhilbandi · 2020-02-24T08:38:08Z

README.md


 Start generating and publishing logs to Kafka from the `/snappy-poc/` folder
 ```
 ./gradlew generateAdImpressions
 ```

-You can see the Spark streaming processing batches of data once every second in the [Spark console](http://localhost:4040/streaming/). It is important that our stream processing keeps up with the input rate. So, we note that the 'Scheduling Delay' doesn't keep increasing and 'Processing time' remains less than a second.
+You can monitor the streaming query processing on the [Structured Streaming UI](http://localhost:5050/structuredstreaming/). It is


Should the link title be "SnappyData Structured Streaming UI" as it's something we have implemented?

README.md

- updating all doc references of `snappy-poc` to `snappy-examples`

streaming aggregation state is cleaned up immediately- This also means delayed events won't be adjusted with the aggregation state. We are anyways not handling delayed events in this use case as we are using column table without key columns hence the sink will always use insert operation. In order to handle delayed events we will need to use putInto which will be very expensive anyways. SNAP-3285 is logged to handle delayed events in more optimized manner.

Vatsal Mevada added 17 commits October 17, 2019 19:20

Removing memsql, cassandra, rabitmq modules.

993094b

- SnappyAPILogAggregator class contains the structured streaming job for

44e94e4

ad analytics. - Upgraded the producer to kafka 10 - removed async producer class as with new version of kafka by default the producer is async - cleaning up code which is not used anymore

[SNAP-3195] Exposing spark.sql.constraintPropagation.enabled config

a341625

to disable optimization rules related to constraint propagation. Cherry-picked from e011004bedca47be998a0c14fe22a6f9bb5090cd and resolved merge conflicts.

Including some compileOnly dependencies as part of test dependencies.

4c6fc3d

Updating README file.

50e679a

Fixing serializer name

a0da528

Using structured streaming windowed aggregation with watermarking

ca36c75

instead of custom sink implementaion.

Formatting and correcting README

2322df4

Cleaning up some unused code. Completing some TODOs.

0ca98cf

Removed some TODOs along with minor documentation corrections

058c3d3

Clean up

aba2792

some corrections

20db11a

Increasing max offset value per trigger.

25c7024

Creating sample table on top of target table.

2bb9391

Dropping sample table before base table.

bfd1547

Adding docs in the code.

1b300ce

vatsalmevada requested review from jramnara, sumwale and suranjan January 13, 2020 10:23

vatsalmevada commented Jan 13, 2020

View reviewed changes

build.gradle Outdated Show resolved Hide resolved

Vatsal Mevada added 5 commits January 13, 2020 16:19

Updating copyright header

c1d4c51

removing unused maven repos

137e0a5

Updating copyright header

8146081

- updating diagram

7ea61b0

- cleanup - correction in readme - bumping up snappy version

Keeping default number of producer threads to 1

9ef4121