title | description | keywords | services | documentationcenter | author | manager | editor | ms.assetid | ms.service | ms.custom | ms.devlang | ms.topic | ms.tgt_pltfrm | ms.workload | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Apache Spark streaming with Kafka - Azure HDInsight | Microsoft Docs |
Learn how to use Spark Apache Spark to stream data into or out of Apache Kafka using DStreams. In this example, you stream data using a Jupyter notebook from Spark on HDInsight. |
kafka example,kafka zookeeper,spark streaming kafka,spark streaming kafka example |
hdinsight |
Blackmist |
jhubbard |
cgronlun |
dd8f53c1-bdee-4921-b683-3be4c46c2039 |
hdinsight |
hdinsightactive |
article |
na |
big-data |
06/13/2017 |
larryfr |
Learn how to use Spark Apache Spark to stream data into or out of Apache Kafka on HDInsight using DStreams. This example uses a Jupyter notebook that runs on the Spark cluster.
Note
The steps in this document create an Azure resource group that contains both a Spark on HDInsight and a Kafka on HDInsight cluster. These clusters are both located within an Azure Virtual Network, which allows the Spark cluster to directly communicate with the Kafka cluster.
When you are done with the steps in this document, remember to delete the clusters to avoid excess charges.
Apache Kafka on HDInsight does not provide access to the Kafka brokers over the public internet. Anything that talks to Kafka must be in the same Azure virtual network as the nodes in the Kafka cluster. For this example, both the Kafka and Spark clusters are located in an Azure virtual network. The following diagram shows how communication flows between the clusters:
Note
Though Kafka itself is limited to communication within the virtual network, other services on the cluster such as SSH and Ambari can be accessed over the internet. For more information on the public ports available with HDInsight, see Ports and URIs used by HDInsight.
While you can create an Azure virtual network, Kafka, and Spark clusters manually, it's easier to use an Azure Resource Manager template. Use the following steps to deploy an Azure virtual network, Kafka, and Spark clusters to your Azure subscription.
-
Use the following button to sign in to Azure and open the template in the Azure portal.
The Azure Resource Manager template is located at https://hditutorialdata.blob.core.windows.net/armtemplates/create-linux-based-kafka-spark-cluster-in-vnet-v2.1.json.
[!WARNING] To guarantee availability of Kafka on HDInsight, your cluster must contain at least three worker nodes. This template creates a Kafka cluster that contains three worker nodes.
This template creates an HDInsight 3.6 cluster for both Kafka and Spark.
-
Use the following information to populate the entries on the Custom deployment blade:
-
Resource group: Create a group or select an existing one. This group contains the HDInsight cluster.
-
Location: Select a location geographically close to you.
-
Base Cluster Name: This value is used as the base name for the Spark and Kafka clusters. For example, entering hdi creates a Spark cluster named spark-hdi__ and a Kafka cluster named kafka-hdi.
-
Cluster Login User Name: The admin user name for the Spark and Kafka clusters.
-
Cluster Login Password: The admin user password for the Spark and Kafka clusters.
-
SSH User Name: The SSH user to create for the Spark and Kafka clusters.
-
SSH Password: The password for the SSH user for the Spark and Kafka clusters.
-
-
Read the Terms and Conditions, and then select I agree to the terms and conditions stated above.
-
Finally, check Pin to dashboard and then select Purchase. It takes about 20 minutes to create the clusters.
Once the resources have been created, you are redirected to a blade for the resource group that contains the clusters and web dashboard.
Important
Notice that the names of the HDInsight clusters are spark-BASENAME and kafka-BASENAME, where BASENAME is the name you provided to the template. You use these names in later steps when connecting to the clusters.
The code for the example described in this document is available at https://github.com/Azure-Samples/hdinsight-spark-scala-kafka.
Follow the steps in the README.md
file to complete this example.
[!INCLUDE delete-cluster-warning]
Since the steps in this document create both clusters in the same Azure resource group, you can delete the resource group in the Azure portal. Deleting the group removes all resources created by following this document, the Azure Virtual Network, and storage account used by the clusters.
In this example, you learned how to use Spark to read and write to Kafka. Use the following links to discover other ways to work with Kafka: