This project showcases the exceptional capabilities of Apache Flink, known for its high performance, scalability, and advanced stream processing features, which is a core component of the signalRoom technology stack. Staying true to our mission of empowering developers by sharing practical knowledge, we demonstrate the use of Flink through real-world examples based on the blog series "Building Apache Flink Applications in Java."
Flink App | Description |
---|---|
DataGeneratorApp |
This Flink App generates realistic flight data for the fictional airlines Sunset Air and Sky One Airlines, seamlessly integrating modern data streaming technologies. Flight events are published to dedicated Kafka topics (airline.sunset and airline.skyone ), enabling real-time processing and analysis. Simultaneously, the synthetic data flows into the apache_kickstarter.airlines.skyone_airline and apache_kickstarter.airlines.sunset_airline Apache Iceberg Tables, respectively, stored in an AWS S3 bucket, providing a robust, scalable foundation for historical data analytics. The entire solution is implemented in Java, overcoming limitations in PyFlink, which currently lacks a Python-based Apache Flink data generator source. This design ensures a powerful, flexible data pipeline, ideal for both real-time and batch use cases. |
AvroDataGeneratorApp |
WARNING: STILL IN DEVELOPMENT #463 This Flink App generates realistic flight data for the fictional airlines Sunset Air and Sky One Airlines, seamlessly integrating modern data streaming technologies. Flight events are stored as Avro data instead of JSON data in the DataGeneratorApp , and are published to dedicated Kafka topics (airline.sunset_avro and airline.skyone_avro ), enabling real-time processing and analysis. Simultaneously, the synthetic data flows into the apache_kickstarter.airlines.skyone_airline and apache_kickstarter.airlines.sunset_airline Apache Iceberg Tables, respectively, stored in an AWS S3 bucket, providing a robust, scalable foundation for historical data analytics. The entire solution is implemented in Java, overcoming limitations in PyFlink, which currently lacks a Python-based Apache Flink data generator source. This design ensures a powerful, flexible data pipeline, ideal for both real-time and batch use cases. |
FlightImporterApp |
Imports flight data from airline.sunset and airline.skyone Kafka topics and standardizes them into a unified airline.flight Kafka topic, and , and the apache_kickstarter.airlines.flight Apache Iceberg Table. Implemented in both Java and Python. |
FlyerStatsApp |
Processes data from the airline.flight Kafka topic to aggregate flyer statistics into the airline.flyer_stats Kafka topic, and the apache_kickstarter.airlines.flyer_stats Apache Iceberg Table. Implemented in both Java and Python. |
Created by Wade Waldron, Staff Software Practice Lead at Confluent Inc., these applications are designed to handle enterprise-scale use cases to demonstrate Flink's potential for real-world applications. This will start with securely fetching Kafka Cluster and Schema Registry Cluster API keys via AWS Secrets Manager, and retrieving essential Kafka client configuration properties through the AWS Systems Manager Parameter.
Beyond simply transforming and enriching data, we stream it into Kafka topics and Apache Iceberg Tables, designed for high-performance, large-scale analytics with support for ACID transactions, offering highly scalable, real-time data processing and durable storage.
Our journey with Flink doesn’t end with Java; Python-based Flink applications also shine in this project, leveraging Flink SQL, Table API, and DataFrame API, each offering unique advantages: Flink SQL for declarative queries, Table API for a mix of SQL-like syntax and programmatic control, and DataFrame API for efficient data manipulation similar to pandas or R. This provides a seamless, flexible approach to solving complex streaming challenges, ultimately unlocking an efficient and scalable architecture for modern data needs.
Explore the repo, run the Docker containers (compatible with both Mac M chips and non-Mac M chips), and see how Apache Flink can be utilized to build robust, enterprise-grade streaming applications.
Table of Contents
- 1.0 Are these examples better described as Flink Jobs or Flink Applications-—and why does the distinction matter?
- 2.0 Let's get started!
- 3.0 Hands-On Kickoff: Practical Examples for Rapid Learning
- 4.0 Resources
- 5.0 Important Note(s)
1.0 Are these examples better described as Flink Jobs or Flink Applications-—and why does the distinction matter?
"What's in a name? That which we call a rose by any other name would smell just as sweet."
-- William Shakespeare
Did you know Flink jobs are often called Flink applications? It’s not just a fancy name upgrade—there’s a good reason behind it! Unlike a single task or computation, a Flink job is like a mini software ecosystem. Calling it an 'application' captures its true essence: a complex, integrated workflow that solves big data challenges, just like any full-fledged software app.
By using 'Flink applications,' we're emphasizing their depth and sophistication. It’s not just about crunching numbers; it’s about building something comprehensive, dynamic, and powerful—just like any great app you use every day. (Curious for more details? Check out the rationale here!)
Ready to supercharge your data streaming skills? As of October 4, 2024, Apache Flink on Confluent Cloud brings incredible possibilities and a few challenges! Currently, it doesn't support the DataStream API, and its Table API is limited (click here for a list of what is supported) to creating Java or Python-based Flink applications. Don't worry, though—we've got you covered!
In this repo, you'll find an example code where we use the Java DataStream API to create a custom source beyond Flink's typical capabilities. Imagine making external calls to AWS Secrets Manager to securely retrieve Kafka Cluster API Key secrets or pulling Kafka consumer and producer client configuration properties from the AWS Systems Manager Parameter Store—all directly integrated into Flink!
Prefer Python? We've got you there too. With the Table API, you can use a User-Defined Table Function to do the same. Plus, we don't just sink data into Kafka topics; this repo showcases how to store it in Apache Iceberg tables, unlocking even more power for your data.
To help you start quickly, the repo comes with Docker containers for Mac M chip and non-Mac M chip machines, letting you run an Apache Flink 1.19.1 cluster locally. For the Kafka and Schema Registry resources, we're fully leveraging Confluent Cloud. Dive in and explore—the future of data streaming is here!
These are the steps
-
Take care of the cloud and local environment prequisities listed below:
You need to have the following cloud accounts:
- AWS Account with SSO configured
- Confluent Cloud Account
- Docker Account
- GitHub Account with OIDC configured for AWS
- Snowflake Account
- Terraform Cloud Account
You need to have the following installed on your local machine:
-
Clone the repo:
git clone https://github.com/j3-signalroom/apache_flink-kickstarter.git
-
Set up your Terraform Cloud environment locally or leverage GitHub Actions to create the complete setup effortlessly. Here's what you can expect:
-
A Confluent Cloud environment featuring a Kafka Cluster, fully equipped with pre-configured example Kafka topics—ready to power your data streaming needs.
-
AWS Secrets Manager securely storing API Key Secrets for the Kafka Cluster, along with AWS Systems Manager Parameter Store containing Kafka Consumer and Producer properties for easy integration.
-
An AWS S3 bucket with a dedicated
warehouse
folder, serving as the landing zone for Apache Iceberg Tables populated by two Python-based Flink apps, bringing your data streaming architecture to life.
-
-
Run Apache Flink locally on your Mac, or use the provided Docker containers from the project to launch Apache Flink and Apache Iceberg seamlessly on your machine.
With these steps, you'll have everything set up to run enterprise-grade data streaming applications in no time!
Install the Terraform CLI on your local machine, and make sure you have an HCP Terraform account to run the Terraform configuration. Learn how to set up Terraform Cloud for local use by clicking here.
scripts/run-terraform-locally.sh <create | delete> --profile=<SSO_PROFILE_NAME>
--confluent-api-key=<CONFLUENT_API_KEY>
--confluent-api-secret=<CONFLUENT_API_SECRET>
--snowflake-warehouse=<SNOWFLAKE_WAREHOUSE>
--service-account-user=<SERVICE_ACCOUNT_USER>
--day-count=<DAY_COUNT>
--auto-offset-reset=<earliest | latest>
--number-of-api-keys-to-retain=<NUMBER_OF_API_KEYS_TO_RETAIN>
Argument placeholder Replace with <SSO_PROFILE_NAME>
your AWS SSO profile name for your AWS infrastructue that host your AWS Secrets Manager. <CONFLUENT_API_KEY>
your organization's Confluent Cloud API Key (also referred as Cloud API ID). <CONFLUENT_API_SECRET>
your organization's Confluent Cloud API Secret. <SNOWFLAKE_WAREHOUSE>
the Snowflake warehouse (or "virtual warehouse") you choose to run the resources in Snowflake. <SERVICE_ACCOUNT_USER>
the Snowflake service account user who is to be assigned the RSA key pairs for its authentication. <DAY_COUNT>
how many day(s) should the API Key be rotated for. <AUTO_OFFSET_RESET>
Use earliest
, when you want to read the first event in a Kafka topic. Otherwise, specifylatest
.<NUMBER_OF_API_KEYS_TO_RETAIN>
Specifies the number of API keys to create and retain.
To learn more about this script, click here.
In order to run the Terraform configuration from GitHub, the Terraform Cloud API token and Confluent Cloud API Key are required as GitHub Secret variables. Learn how to do to get the Terraform Cloud API token and Confluent Cloud API key here.
Follow these steps:
a. Deploy the Repository: Ensure that you have cloned or forked the repository to your GitHub account.
b. Set Required Secrets and Variables: Before running any of the GitHub workflows provided in the repository, you must define at least the AWS_DEV_ACCOUNT_ID
variable (which should contain your AWS Account ID for your development environment). To do this:
-
Go to the Settings of your cloned or forked repository in GitHub.
-
Navigate to Secrets and Variables > Actions.
-
Add the
AWS_DEV_ACCOUNT_ID
and any other required variables or secrets.
c. Navigate to the Actions Page:
- From the cloned or forked repository on GitHub, click on the Actions tab.
d. Select and Run the Deploy Workflow:
-
Find the Deploy workflow link on the left side of the Actions page and click on it.
-
On the Deploy workflow page, click the Run workflow button.
-
A workflow dialog box will appear. Fill in the necessary details and click Run workflow to initiate the Terraform deployment process.
By following these steps, you will run the Terraform configuration directly from GitHub, leveraging GitHub Actions for automation and deployment.
Apache Flink's Core is Dataflow Programming
What is Apache Flink? — Architecture