Zeppelin and Spark SQL on HDInsight

Interactive analysis have become a major part of the field of Data Science. Two tools have become very popular, Jupyter and Zeppelin.
This article will show you how to provision a Spark cluster and run analysis on it with the help of Zeppelin.

What is HDInsight?

Simply said, HDInsight is Microsoft’s version of Hadoop running on Azure. Working together with Hortonworks Microsoft added support for Windows Server to Hadoop. HDInsight has different kinds of clusters, the normal HDInsight cluster, the Storm cluster and the Spark cluster. Lately a new cluster version running on Linux was published as a preview. HDInsight tries to make Big Data more affordable. You can start a cluster when you need it and shut it down as soon as the work is done. To get this working Microsoft added support for Azure Blob Storage as the storage engine for Hadoop (you can easily identify Hadoop clusters using Azure Blob Storage by looking for wasb in the cluster configuration.

What is Spark SQL?

During the last two years Spark gained a lot of attention from several companies. The creators of Spark identified the disk storage as one of the reasons why Hadoop is not as fast as it could be. With Spark most of the data is kept in memory. A nice thing about the Spark API is its lazy evaluation. To understand lazy evaluation in Spark you have to know the two kind of operations in Spark, transformations and actions. Only if you call an action on an RDD (Resilient Distributed Dataset) it will be evaluated. The RDD consists of a Directed Acyclic Graph containing the transformations called on the RDD. With that you can recalculate a RDD if a node crashes before an operation is finished.
Spark itself consists of several components. The underlying APIs are part of Spark Core, which contains abstractions such as RDDs. On top of Spark Core you have Spark Streaming, GraphX, MLlib and Spark SQL. With Spark Streaming, as the name implies, you can create streaming application in a micro-batch fashion by defining a window. GraphX is the component to work with graphs in Spark. The third component is called MLlib which is used for machine learning inside of Spark. The component we will use from this point on is Spark SQL, a SQL abstraction for easier work with different kind of datasets.
But how does Spark gained so much traction? In most cases you do not use one component for itself. You combine them. Imagine a streaming application which has to score a model on the fly with reference data. In that case you would use Spark Streaming as the initial application type, use Spark SQL to get the reference data and score the model via MLlib.

What is Zeppelin?

Zeppelin has notebooks. Notebooks? Notebooks are a nice way to interactively explore a dataset by evaluating one operation after another. Things you do in the interactive Spark shell can be done in a graphical way with nice visualizations. But the key thing is that you can share notebooks. Have you ever got to the point “Hell no, why didn’t I can remember the commands I typed into the shell and can share it with one of my co-workers?”
That is the use case for notebooks. Notebooks are not tight to Spark. You can use them with regular Python, Scala and other things where you can get a kernel for. The kernel represents the language binding from the notebook to its executioner. If you love F#, there is also a kernel for it. Today we have two notebooks tools, which are popular. The first one was already mentioned, Zeppelin. The second one is called Jupyter and was created as part of IPython.

Create a Spark cluster on Azure

Let’s start our journey by going to the Azure portal. Click on NEW at the top-left corner.

hdinsight-zeppelin-01

Go to Data + Analytics and select HDInsight.

hdinsight-zeppelin-02

We will create a new Spark cluster, which is currently in preview, running on a Windows Server.

hdinsight-zeppelin-03

To later get access to our cluster we need to provide some credentials. As a hint: Don’t name your cluster administrator admin. If you do worse things can happen.

hdinsight-zeppelin-04

The next thing we need to define is our data source. If you start playing with HDInsight I would recommend to create a new storage account. This storage account contains everything needed for our cluster as well as some sample data, which we will use later instead of looking a long time for an appropriate dataset.

hdinsight-zeppelin-05

We also need to define the size of our cluster and the size of the worker nodes and the head node. To make things simple I go with the default sizes.

hdinsight-zeppelin-06

The last thing we need to do is hit Create and the cluster creation will start.

hdinsight-zeppelin-07

This could take a while, so get yourself a coffee or go shopping.

Open Zeppelin

After the cluster is successfully created we need to access Zeppelin. On the cluster page in the portal go to the Quick Links and click on Cluster Dashboards.

hdinsight-zeppelin-08

The one interesting for us is the Zeppelin Notebook.

hdinsight-zeppelin-09

After authenticating yourself you should see something similar to the following.

hdinsight-zeppelin-10

In Zeppelin you can create a note to get your analysis started. To do so, click on Notebook at the top and choose Create new note.

hdinsight-zeppelin-11

Give the notebook a name…

hdinsight-zeppelin-12

…and you should see a nice blank notebook.

hdinsight-zeppelin-13

To see which interpreter bindings are active go to the settings at the top-right.

hdinsight-zeppelin-14

As you see we only have two bindings, one for Spark and one for Markdown, but that should be enough for our use case. These bindings are helpful to get some magic under the hood, for example with %md you can write Markdown in your notebook to add some documentation to it.

Load a dataset

Alright, after everything is in place we can get started. Can we? Are we missing something? Of course, we need some data to play around with. As mentioned before, with the creation of our cluster a new storage account was created and this storage accounts contains … some datasets. So let’s go to our storage account in the portal.

hdinsight-zeppelin-15

Choose the account which was created during the cluster creation. On the page for the blob service you should see one container. Click on it and you should see a new blade opening with a lot of folders in it.

hdinsight-zeppelin-16

In the folder HdiSamples you will find five different datasets which can be used for different use cases. I’d like to go with the dataset containing Twitter trends. When clicking on the appropriate folder with you will a simple text file containing our data.

hdinsight-zeppelin-17

If you want, you can download the file to see what the data looks like. The important thing is that you make a note of the storage location of the file. We will need it very soon. Let’s go back to our notebook.

Simple analysis

Back in our notebook we want to load our dataset and do some analysis on it with Spark SQL. To use Spark SQL we need a SQLContext. Luckily this is already built into Zeppelin, so we don’t need to define one. If you had a look at the dataset you had noticed that it is JSON. That will save us a lot of time, because with don’t have to parse it ourself. Let Spark do the work for us.

val tweets = sqlContext.jsonFile("wasb:///HdiSamples/TwitterTrendsSampleData/tweets.txt")

If you aren’t familiar with Scala, we are using the sqlContext from Zeppelin and call a function to load the JSON file from our storage account to store it in tweets. wasb is a storage definition used by HDInsight, which can also be used in every other Hadoop cluster. If you click run (or Shift + Enter) you should see that a DataFrame with the data was created for us.

hdinsight-zeppelin-18

A DataFrame is a new abstraction in Spark similar to the one used in pandas.

Wouldn’t it be cool to see the actual schema of our data? That’s easy.

tweets.printSchema()

The output should look similar to the following one.

hdinsight-zeppelin-19

That’s a lot of magic jsonFile has done for us. Let’s now register the data as a table called tweets.

tweets.registerTempTable("tweets")

With that we can access the data from Spark SQL. To get the text of five tweets we would use the following code snippet.

sqlContext.sql("select text from tweets limit 5").collect().foreach(println)

The parameter to the sql function should be easy to read, so why do we need the rest. The collect function is an action we use to get the computation of our data started (remember the difference between transformations and actions in Spark). The foreach function with the println function passed to it lets us print the results in a readable fashion.

hdinsight-zeppelin-21

With that you have a starting point to play around with the data. If you are curios you should have a look at the official Zeppelin documentation and the Spark SQL documentation.

A last tip: HDInsight is very expensive. If you don’t need your cluster, shut it down.

Summary

In this article you got a short introduction to HDInsight, Spark and Spark SQL and finally Zeppelin. We covered a lot of ground … and there is plenty of it left. If you want to learn Spark or just play around with it, set up a cluster and use Zeppelin to fire off your queries.

Jan (@Horizon_Net)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s