Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNAP 2710 - Updating code to use structured streaming #17

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

vatsalmevada
Copy link
Contributor

Following major changes are done:

  • updating code to use structured streaming
  • removed modules related to performance comparison with other products
    (will have to re-run benchmarks with structured streaming code and such comparison in future)
  • cleaning up unused code

Vatsal Mevada added 17 commits October 17, 2019 19:20
ad analytics.

- Upgraded the producer to kafka 10

- removed async producer class as with new version of kafka by default
 the producer is async

- cleaning up code which is not used anymore
to disable optimization rules related to constraint propagation.

Cherry-picked from e011004bedca47be998a0c14fe22a6f9bb5090cd and resolved
merge conflicts.
 deserialization. Instead creating new instance of Deserializer per
 partition.
- Also explicitly providing new instance of `SpecificData`
 while creating `SpecificDatumReader`. Without this the
  `SpecificDatumReader` internally uses a singleton instance of
  `SpecificData` which maintains cache of loaded classes. This can lead
  to `ClassCastException` as the DatumReader end up using cached
  classes which are loaded by different classloader.
build.gradle Outdated Show resolved Hide resolved
- Enabling streaming, transactions and interactive analytics in a single unifying system rather than stitching different solutions—and
- Delivering true interactive speeds via a state-of-the-art approximate query engine that leverages a multitude of synopses as well as the full dataset. SnappyData implements this by deeply integrating an in-memory database into Apache Spark.
[SnappyData](https://github.com/SnappyDataInc/snappydata) aims to deliver real time operational analytics at interactive
speeds with commodity infrastructure and far less complexity than today.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"far less complexity than today" - looks a little vague. Can we be more specific as to what complexities are involved and how are we getting rid of those?

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated
incorporate more complex analytics, rather than using map-reduce).
- Demonstrate storing the pre-aggregated logs into the SnappyData columnar store with high efficiency. While the store
itself provides a rich set of features like hybrid row+column store, eager replication, WAN replicas, HA, choice of memory-only, HDFS, native disk persistence, eviction, etc we only work with a column table in this simple example.
- Run OLAP queries from any SQL client both on the full data set as well as sampled data (showcasing sub-second

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a link to the AQP documentation be provided here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just formatting changes. In current master also AQP is not mentioned. Although down the line we have explained the AQP capabilities separately in the same readme file.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
are then aggregated using [Spark Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) into the SnappyData Store. External clients connect to the same cluster using JDBC/ODBC and run arbitrary OLAP queries.
As AdServers can feed logs from many websites and given that each AdImpression log message represents a single Ad viewed
by a user, one can expect thousands of messages every second. It is crucial that ingestion logic keeps up with the
stream. To accomplish this, SnappyData collocates the store partitions with partitions created by Spark Streaming.
Copy link

@nikhilbandi nikhilbandi Feb 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to mention briefly how does the collocation happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't be easy to explain collocation in brief. Will see whether we have any link in documentation which we can provide here.

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated

Start generating and publishing logs to Kafka from the `/snappy-poc/` folder
```
./gradlew generateAdImpressions
```

You can see the Spark streaming processing batches of data once every second in the [Spark console](http://localhost:4040/streaming/). It is important that our stream processing keeps up with the input rate. So, we note that the 'Scheduling Delay' doesn't keep increasing and 'Processing time' remains less than a second.
You can monitor the streaming query processing on the [Structured Streaming UI](http://localhost:5050/structuredstreaming/). It is

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the link title be "SnappyData Structured Streaming UI" as it's something we have implemented?

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Vatsal Mevada added 7 commits February 24, 2020 16:33
- updating all doc references of `snappy-poc` to `snappy-examples`
streaming aggregation state is cleaned up immediately-
This also means delayed events won't be adjusted with the aggregation
state. We are anyways not handling delayed events in this use case as
we are using column table without key columns hence the sink will always
use insert operation. In order to handle delayed events we will need to
use putInto which will be very expensive anyways.

SNAP-3285 is logged to handle delayed events in more optimized manner.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants