-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNAP 2710 - Updating code to use structured streaming #17
base: master
Are you sure you want to change the base?
Conversation
ad analytics. - Upgraded the producer to kafka 10 - removed async producer class as with new version of kafka by default the producer is async - cleaning up code which is not used anymore
to disable optimization rules related to constraint propagation. Cherry-picked from e011004bedca47be998a0c14fe22a6f9bb5090cd and resolved merge conflicts.
deserialization. Instead creating new instance of Deserializer per partition. - Also explicitly providing new instance of `SpecificData` while creating `SpecificDatumReader`. Without this the `SpecificDatumReader` internally uses a singleton instance of `SpecificData` which maintains cache of loaded classes. This can lead to `ClassCastException` as the DatumReader end up using cached classes which are loaded by different classloader.
instead of custom sink implementaion.
- cleanup - correction in readme - bumping up snappy version
- Enabling streaming, transactions and interactive analytics in a single unifying system rather than stitching different solutions—and | ||
- Delivering true interactive speeds via a state-of-the-art approximate query engine that leverages a multitude of synopses as well as the full dataset. SnappyData implements this by deeply integrating an in-memory database into Apache Spark. | ||
[SnappyData](https://github.com/SnappyDataInc/snappydata) aims to deliver real time operational analytics at interactive | ||
speeds with commodity infrastructure and far less complexity than today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"far less complexity than today" - looks a little vague. Can we be more specific as to what complexities are involved and how are we getting rid of those?
README.md
Outdated
incorporate more complex analytics, rather than using map-reduce). | ||
- Demonstrate storing the pre-aggregated logs into the SnappyData columnar store with high efficiency. While the store | ||
itself provides a rich set of features like hybrid row+column store, eager replication, WAN replicas, HA, choice of memory-only, HDFS, native disk persistence, eviction, etc we only work with a column table in this simple example. | ||
- Run OLAP queries from any SQL client both on the full data set as well as sampled data (showcasing sub-second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should a link to the AQP documentation be provided here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just formatting changes. In current master also AQP is not mentioned. Although down the line we have explained the AQP capabilities separately in the same readme file.
are then aggregated using [Spark Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) into the SnappyData Store. External clients connect to the same cluster using JDBC/ODBC and run arbitrary OLAP queries. | ||
As AdServers can feed logs from many websites and given that each AdImpression log message represents a single Ad viewed | ||
by a user, one can expect thousands of messages every second. It is crucial that ingestion logic keeps up with the | ||
stream. To accomplish this, SnappyData collocates the store partitions with partitions created by Spark Streaming. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be better to mention briefly how does the collocation happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't be easy to explain collocation in brief. Will see whether we have any link in documentation which we can provide here.
README.md
Outdated
|
||
Start generating and publishing logs to Kafka from the `/snappy-poc/` folder | ||
``` | ||
./gradlew generateAdImpressions | ||
``` | ||
|
||
You can see the Spark streaming processing batches of data once every second in the [Spark console](http://localhost:4040/streaming/). It is important that our stream processing keeps up with the input rate. So, we note that the 'Scheduling Delay' doesn't keep increasing and 'Processing time' remains less than a second. | ||
You can monitor the streaming query processing on the [Structured Streaming UI](http://localhost:5050/structuredstreaming/). It is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the link title be "SnappyData Structured Streaming UI" as it's something we have implemented?
- updating all doc references of `snappy-poc` to `snappy-examples`
streaming aggregation state is cleaned up immediately- This also means delayed events won't be adjusted with the aggregation state. We are anyways not handling delayed events in this use case as we are using column table without key columns hence the sink will always use insert operation. In order to handle delayed events we will need to use putInto which will be very expensive anyways. SNAP-3285 is logged to handle delayed events in more optimized manner.
Following major changes are done:
(will have to re-run benchmarks with structured streaming code and such comparison in future)