Recently I had the opportunity to give a talk with the topic Big Data Latest and Greatest. Sounds like a marketing session? No, it is not … and I hope you will see it the same way. This blog post tries to extend this session.
What was the big idea behind this session? The Hadoop ecosystem is increasing every month and it is hard to keep in step with the latest versions and products.
That is just an excerpt of a whole bunch of technologies.
In this post we will discuss the following pieces:
Instead of digging deep into the technologies I will try to compare some of them and try to picture some use cases.
A short disclaimer: I don’t call myself an expert in all of these technologies (who would do such a thing?). Where possible I try to give you hints where you can find more resources.
Apache Drill just went Beta a few weeks ago. It is an ad-hoc query engine for Hadoop with its own query language, called DrQL, based on Google’s Dremel. So what is the difference between Dremel and Drill?
|–||Distributed File System||NoSQL||Interactive analysis||Batch processing|
It was originally developed by MapR.
As said before, Drills query language is called DrQL, but you can plug in other language and programming models as you like. Drill itself has its focus on complex, nested, semi-structured data, such as JSON, Apache Avro or Parquet. That said, it is possible to query unknown schemas and, if you like, to use the automatic schema recognition. Drill itself several databases: HDFS, HBase, MongoDB and more.
So, why do we need another query language. Drill was built for extreme scalability on top of massive parallel computation. In some scenarios more than 10.000 servers were used to analyze Petabytes of data in just a few seconds. How does that work? Drill uses a distributed engine.
Alright, sounds cool, but give me more insights! Why do we exactly need another query language?
A picture says more than thousand words!
Traditional technologies, such as MapReduce, Hive and Pig are used for data mining and major ETL processes, which can take from a few minutes to hours. That is too slow for interactive queries or reporting. That is the case for Drill!
How does Drill work?
I will try to give you a high level overview about the architecture. You will find more information on this site.
A Drillbit is the core of Apache Drill. The Drillbit is responsible for the query execution and the scheduling.
The flow of a Drill query typically involves the following steps:
- The Drill client issues a query. A Drill client is a JDBC, ODBC, command line interface or a REST API. Any Drillbit in the cluster can accept queries from the clients. There is no master-slave concept.
- The Drillbit then parses the query, optimizes it, and generates a distributed query plan that is optimized for fast and efficient execution.
- The Drillbit that accepts the query becomes the driving Drillbit node for the request. It gets a list of available Drillbit nodes in the cluster from ZooKeeper. The driving Drillbit determines the appropriate nodes to execute various query plan fragments to maximize data locality.
- The Drillbit schedules the execution of query fragments on individual nodes according to the execution plan.
- The individual nodes finish their execution and return data to the driving Drillbit.
- The driving Drillbit streams results back to the client.
Lets get a little bit deeper into a Drillbits module.
Whoa, what is all that? We start at the top.
RCP end point: Drill exposes a low overhead protobuf-based RPC protocol to communicate with the clients. Additionally, a C++ and Java API layers are also available for the client applications to interact with Drill. Clients can communicate to a specific Drillbit directly or go through a ZooKeeper quorum to discover the available Drillbits before submitting queries. It is recommended that the clients always go through ZooKeeper to shield clients from the intricacies of cluster management, such as the addition or removal of nodes.
SQL parser: Drill uses Optiq, the open source framework, to parse incoming queries. The output of the parser component is a language agnostic, computer-friendly logical plan that represents the query.
Optimizer: Drill uses various standard database optimizations such as rule based/cost based, as well as data locality and other optimization rules exposed by the storage engine to re-write and split the query. The output of the optimizer is a distributed physical query plan that represents the most efficient and fastest way to execute the query across different nodes in the cluster.
Execution engine: Drill provides a massive parallel programming (MPP) execution engine built to perform distributed query processing across the various nodes in the cluster.
Storage plugin interfaces: Drill serves as a query layer on top of several data sources. Storage plugins in Drill represent the abstractions that Drill uses to interact with the data sources. Storage plugins provide Drill with the following information:
– Metadata available in the source
– Interfaces for Drill to read from and write to data sources
– Location of data and a set of optimization rules to help with efficient and faster execution of Drill queries on a specific data source
Distributed cache: Drill uses a distributed cache to manage metadata (not the data) and configuration information across various nodes. Sample metadata information that is stored in the cache includes query plan fragments, intermediate state of the query execution, and statistics. Drill uses Infinispan as its cache technology.
To close that section about Drill, here is how Drill looks like for a developer:
On the left you see the source document, at the top the DrQL query and at the right the output coming from the source and the query.
Alright, next one!
Tajo is a distributed data warehouse based on top of HDFS. The development started in 2010 and since 2013 it is an Apache Incubator project. Some key facts about Tajo:
- It is relational and distributed
- It supports SQL standards
- It can be seen as an alternative to Hive and Pig
Tajo is best suited for ETL processes of big data sets, such as HDFS and other data sources, online aggregations and scalable ad-hoc queries.
No, of course not. With Tajo you can use existing SQL queries.
Tajo vs Drill
What’s the difference between Tajo and Drill?
- Drill uses an aggregation query based on a full-table scan
- Tajo uses the advantages of MapReduce and parallel databases
Drill is based on a nested data model
Tajo is based on a relational data model
Drill tries to have queries with low latency
- Tajo tries to scale the execution
… and, as said before, Tajo can use common SQL queries.
Instead of rewriting the points mentioned on the project site you can read them by yourself.
At last, a short discussion about the architecture of Tajo.
Tajo has two main components, a Tajo master and its workers. The Tajo master is a dedicated server for providing client service and coordinating query masters. For each query, one query master and many task runners work together. A task runner (worker) includes a local query engine that executes a directed acyclic graph of physical operators. To manage the resources of a large cluster Tajo uses Hadoop Yarn.
Alright, lets move on.
Originally developed by UC Berkeley in 2010, Spark is one of the most discussed technologies in the Hadoop ecosystem. It is used by Groupon, IBM, NASA and Yahoo.
It can be easy described as in-memory MapReduce. Temporary results are stored in the main memory. On top of the Spark core several components were developed, such as support for structural data processing, machine learning, graph- and stream processing. Today developers can use APIs written in Scala, Java and Python.
Spark is based on top of the Hadoop storage system and is 100% compatible with HDFS, HBase and several other Hadoop storage systems.
Ok, what are use cases for Spark?
To make it short: operational dashboards and fraud detection. But these are just two use cases.
Why there is a need for Spark? Is MapReduce not good enough?
Best answer: It depends. It depends on what you would like to do. So lets compare Spark and MapReduce shortly.
Hadoop is too slow and a cluster needs a long warm up phase, which delays the job execution. It is best suited as a batch system for not-time-critical analysis. On the other side, Spark is used as an online system to do time-critical analysis. That said, Hadoop is not a real-time system, Spark can be used as one.
I already said that Spark has some components built on top of its core.
- Spark Core: The cornerstone for all other functionality and is used for the in-memory execution.
- Spark SQL: A engine for Hive data which provides a faster execution up to factor 100 on existing systems.
- Spark Streaming: This component is used for real-time execution of new data, such as from HDFS, Flume, Kafka or Twitter.
- MLlib: MLlib is a scalable library for machine learning.
Alright, I think you are ready for an example. As in most other articles I will use the word count example.
file = spark.textFile(“hdfs://…”)
counts = file.flatMap(lambda line: line.split(“”)) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
Seems simple, right? Now the same in another language.
val file = spark.textFile(“hdfs://…”)
val counts = file.flatMap(line => line.split(“”))
.map(word => (word, 1))
Ok, we have seen a whole bunch of technologies up to this point. Lets finish it with a look at Storm.
The companies behind the development of Storm will give you an idea what Storm is about before you read what it is all about: Twitter and BackType. It is a distributed real-time computation system based on data streams. Storm was written in Clojure and Java and was publicly released in 2011. Who uses Storm today? As you can guess one company using Storm is Twitter. Additionally Yahoo, Groupon and Spotify use components based on Storm.
The basic idea behind Storm is to simplify working with queues and workers. Before I dig deeper: Storm is not in a competition with Hadoop or MapReduce.
Storm can be used for filtering data, meaning to filter out data before it is persisted. It also can be used for data aggregation with a quick aggregation of several data sources into one central data store. Another use case in the continuous processing. You can continuously query messages, process them and keep your data stores up to date.
Enough theory! Do you have some concrete examples?
Sure! Here are some of them:
- Fraud detection
- Detection of hot topics in social networks
- Creation of individual targeted commercials
Streams and Topologies
Before we compare Storm and Spark it is time to discuss the components of Storm. Streams are the main abstraction in Storm. A stream is a sequence of tuples. Tuples are a data structure in Storm, representing a list of values, whereby the values can be of any type.
A topology is a graph, where every node is either a Spout or a Bolt (more on these in a minute).
Spout and Bolt
As seen in the picture above Storm a topology has two main components: Spouts and Bolts.
A Spout is a component which is used as the source for loading streams into a topology. During the load phase it reads tuples from external sources.
A Bolt is responsible for the processing of a message. It can filter, aggregate, connection to databases, etc. A Bolt can do simple stream transformations. For more complex transformations you can chain Bolts. It can process several streams from Spouts and other Bolts.
Workers, Executors and Tasks
A worker processes a subset of a topology. It executes one or more components on one ore more executors. An executor is a created thread of a worker and executes tasks. A task executes the actual data processing of the Spouts and the Bolts.
As before, we will use the word count example to show how Storm works.
In the picture above a file is used as the data source (it could also be Tweets). The File Spout creates the tuples out of the file. This Spout send the data of the file line by line to a Splitter Bolt, which splits the lines into words. These words are processed by a Counter Bolt. In some variations you can also have a Report Bolt after the Counter Bolt.
Storm vs Spark
Spark is best used for …
- … speeding up batch analysis
- … iterative machine learning
- … interactive query and graph processing
Storm is best used for …
- … event processing
- … a workflow-like processing of data
Additionally, Spark brings its own components for machine learning, etc.
But here comes the funny part!
Storm is a stream processing framework, which also can do micro batches.
Spark is a batch processing framework, which also can do micro batches.
Ok, what is the difference?
Give more differences?
Spark is …
- … better suited for data parallel computation.
- … suited for batch processing of incoming updates in user-defined intervals.
Storm is …
- … better suited for task parallel computation.
- … suited for the individual processing of incoming events.
Lets finish this comparison with some words about fault tolerance.
Spark processes data once … and just once. Storm at the other side processes data at least once. With Storm you have to decide if idempotency is important or not, if duplicates are changing the result.
Before this article ends, let me try to classify the different technologies.
I hope you had fun reading this one and as said before … I hope that this article was not a marketing one.