-> Demo | -> Slides |
---|
Figure 1. Final Dashboard on Tableau. In this dashboard, all blue components means recent records, green means historical, purple means the difference between recent and historical. When a particular department being selected, only the data of this department will be rendered on the heat maps, historical line plot, and recent average count.
This is a daily review dashboard showing 311 complaints resolution performance for government agencies. With new complaint records being available via API, the data transform process is scheduled daily to extract the new data and produce a daily review dashboard describing the performance of each agency who is responsible for resolving the requested issues, compared with the historical records.
Figure 2. Data Flow Diagram. For historical data processing, the EMR was utilized once. For the daily updating process, CloudWatch is the trigger for the whole updating process, and data flows are driven by Lambda functions.
- Cost efficient. As data are processed once daily, using Lambda functions (pay only for what you use) helps reduce the cost.
- Handling the API unavailability. As the data source API is under maintenance monthly, the pipeline should 1) not crash when the API is unavailable and 2) be able to process all unprocessed data when the API is back.
- Extensible and scalable. With Kinesis as the ingestion layer, this pipeline can accept more data producers (e.g. adding other APIs) and more data consumers. Meanwhile, Lambda functions, Kinesis, S3, and Redshift are all easily scalable ready for future growth of data.
- Python3
- AWS CLI
311 complaints dataset is available in NYC Open Data. Data can be accessed via Socrate API.
bash ./src/cloudwatch/run_cloudwatch_trigger.sh