title	description	keywords	services	documentationcenter	author	manager	editor	ms.assetid	ms.service	ms.custom	ms.devlang	ms.topic	ms.tgt_pltfrm	ms.workload	ms.date	ms.author
Apache Spark streaming with Kafka - Azure HDInsight \| Microsoft Docs	Learn how to use Spark Apache Spark to stream data into or out of Apache Kafka using DStreams. In this example, you stream data using a Jupyter notebook from Spark on HDInsight.	kafka example,kafka zookeeper,spark streaming kafka,spark streaming kafka example	hdinsight		Blackmist	jhubbard	cgronlun	dd8f53c1-bdee-4921-b683-3be4c46c2039	hdinsight	hdinsightactive		article	na	big-data	06/13/2017	larryfr

Apache Spark streaming (DStream) example with Kafka (preview) on HDInsight

Learn how to use Spark Apache Spark to stream data into or out of Apache Kafka on HDInsight using DStreams. This example uses a Jupyter notebook that runs on the Spark cluster.

Note

The steps in this document create an Azure resource group that contains both a Spark on HDInsight and a Kafka on HDInsight cluster. These clusters are both located within an Azure Virtual Network, which allows the Spark cluster to directly communicate with the Kafka cluster.

When you are done with the steps in this document, remember to delete the clusters to avoid excess charges.

Create the clusters

Apache Kafka on HDInsight does not provide access to the Kafka brokers over the public internet. Anything that talks to Kafka must be in the same Azure virtual network as the nodes in the Kafka cluster. For this example, both the Kafka and Spark clusters are located in an Azure virtual network. The following diagram shows how communication flows between the clusters:

Note

Though Kafka itself is limited to communication within the virtual network, other services on the cluster such as SSH and Ambari can be accessed over the internet. For more information on the public ports available with HDInsight, see Ports and URIs used by HDInsight.

While you can create an Azure virtual network, Kafka, and Spark clusters manually, it's easier to use an Azure Resource Manager template. Use the following steps to deploy an Azure virtual network, Kafka, and Spark clusters to your Azure subscription.

Use the following button to sign in to Azure and open the template in the Azure portal.

The Azure Resource Manager template is located at https://hditutorialdata.blob.core.windows.net/armtemplates/create-linux-based-kafka-spark-cluster-in-vnet-v2.1.json.

[!WARNING] To guarantee availability of Kafka on HDInsight, your cluster must contain at least three worker nodes. This template creates a Kafka cluster that contains three worker nodes.

This template creates an HDInsight 3.6 cluster for both Kafka and Spark.
Use the following information to populate the entries on the Custom deployment blade:
- Resource group: Create a group or select an existing one. This group contains the HDInsight cluster.
- Location: Select a location geographically close to you.
- Base Cluster Name: This value is used as the base name for the Spark and Kafka clusters. For example, entering hdi creates a Spark cluster named spark-hdi__ and a Kafka cluster named kafka-hdi.
- Cluster Login User Name: The admin user name for the Spark and Kafka clusters.
- Cluster Login Password: The admin user password for the Spark and Kafka clusters.
- SSH User Name: The SSH user to create for the Spark and Kafka clusters.
- SSH Password: The password for the SSH user for the Spark and Kafka clusters.
Read the Terms and Conditions, and then select I agree to the terms and conditions stated above.
Finally, check Pin to dashboard and then select Purchase. It takes about 20 minutes to create the clusters.

Once the resources have been created, you are redirected to a blade for the resource group that contains the clusters and web dashboard.

Important

Notice that the names of the HDInsight clusters are spark-BASENAME and kafka-BASENAME, where BASENAME is the name you provided to the template. You use these names in later steps when connecting to the clusters.

Use the notebooks

The code for the example described in this document is available at https://github.com/Azure-Samples/hdinsight-spark-scala-kafka.

Follow the steps in the README.md file to complete this example.

Delete the cluster

[!INCLUDE delete-cluster-warning]

Since the steps in this document create both clusters in the same Azure resource group, you can delete the resource group in the Azure portal. Deleting the group removes all resources created by following this document, the Azure Virtual Network, and storage account used by the clusters.

Next steps

In this example, you learned how to use Spark to read and write to Kafka. Use the following links to discover other ways to work with Kafka:

Get started with Apache Kafka on HDInsight
Use MirrorMaker to create a replica of Kafka on HDInsight
Use Apache Storm with Kafka on HDInsight

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hdinsight-apache-spark-with-kafka.md

hdinsight-apache-spark-with-kafka.md

Apache Spark streaming (DStream) example with Kafka (preview) on HDInsight

Create the clusters

Use the notebooks

Delete the cluster

Next steps

Files

hdinsight-apache-spark-with-kafka.md

Latest commit

History

hdinsight-apache-spark-with-kafka.md

File metadata and controls

Apache Spark streaming (DStream) example with Kafka (preview) on HDInsight

Create the clusters

Use the notebooks

Delete the cluster

Next steps