11 min read
Apache Spark is a top-level project of the Apache Software Foundation, designed to be used with a range of programming languages on a variety of architectures. Spark’s speed, simplicity, and broad support for existing development environments and storage systems make it increasingly popular with a wide range of developers and relatively accessible to those learning to work with it for the first time.
In this blog post, I’m going to go out on a limb and make a connection between Spark and Legos. Legos are a product of The LEGO Group, designed to be used by a range of consumers in a variety of sets. Legos’ fun, simplicity, and broad support for existing construction toys and building systems make it increasingly popular with a wide range of artists and designers and relatively accessible to those learning to work with it for the first time. See the similarities? (Learn more about using Spark in the ebook Getting Started with Apache Spark: From Inception to Production.)
Lego Batman vs. Lego Indiana Jones: Using Different Programming Languages
Let’s take it a step further. Spark’s capabilities can all be accessed and controlled using a rich API. Just as Lego has incorporated franchises such as Harry Potter, the Avengers, Indiana Jones, and Batman, Spark supports multiple existing programming languages, including Java, Python, Scala, SQL, and R. And just as each Lego set includes an instruction manual, there are extensive examples and tutorials for Spark. For tutorials using bits of code from Java, Python, and Scala, check out the Apache Spark project website. The Apache Spark module, Spark SQL, offers native support for SQL and simplifies the process of querying data stored in Spark’s own Resilient Distributed Dataset (RDD) model. Support for R is more recent; the SparkR package first appeared in June 2015 in release 1.4 of Apache Spark.
Spark gained a new DataFrames API in 2015. DataFrames offers:
Deployment and Storage Options, or Do Your Legos Need a Second Basement?
You can set up a small Lego playset on the side of your desk. But if you feel the need to own the life-sized pirate ship, giant magic castle, and $350 deluxe edition superhero set, you’re going to need more room. Similarly, Spark can run both standalone or as part of a cluster. It’s easy to download and install Spark on a laptop or virtual machine, but that is not likely to be sufficient for production workloads that are operating at scale. In these circumstances, Spark will normally run on an existing big data cluster. These clusters are often used for Hadoop jobs, too, and will usually be managed by Hadoop’s YARN resource manager; see Running Spark on YARN for more details. Spark can also run just as easily on Amazon Web Services’ Elastic Compute Cloud (EC2) or on clusters controlled by Apache Mesos
Regarding storage, Spark can integrate with a range of commercial or open source third-party data storage systems, including:
Developers are most likely to choose the data storage system they are already using elsewhere in their workflow.
Building the Spark Stack
The Spark project stack currently comprises Spark Core and four libraries that are optimized to address the requirements of four different use cases. Individual applications will typically require Spark Core and at least one of these libraries. Spark’s flexibility and power become most apparent in applications that require the combination of two or more of these libraries on top of Spark Core.
The Building Blocks
Resilient Distributed Datasets (RDDs)
The Resilient Distributed Dataset is a concept at the heart of Spark. It is designed to support in-memory data storage, distributed across a cluster in a manner that is demonstrably both fault-tolerant and efficient. Fault-tolerance is achieved, in part, by tracking the lineage of transformations applied to coarse-grained sets of data. Efficiency is achieved through parallelization of processing across multiple nodes in the cluster, and minimization of data replication between those nodes. Once data is loaded into an RDD, two basic types of operation can be carried out:
The original RDD remains unchanged throughout. The chain of transformations from RDD1 to RDDn are logged, and can be repeated in the event of data loss or the failure of a cluster node.
Transformations are said to be lazily evaluated, meaning that they are not executed until a subsequent action has a need for the result. This will normally improve performance, as it can avoid processing data unnecessarily. It can also, in certain circumstances, introduce processing bottlenecks that cause applications to stall while waiting for a processing action to conclude.
These RDDs remain in memory where possible. This greatly increases the performance of the cluster, particularly in use cases with a requirement for iterative queries or processes.
Why Lego and Spark Have the Same Idea
There is not yet an official Spark-themed Lego set, but I think it would work very well. For example, compare the different programming languages Spark supports to real-world franchises that have been turned into Lego. Lego can express the same basic building idea through everything from Teenage Mutant Ninja Turtles to the Lord of the Rings. Similarly, Spark can be utilized through multiple languages ranging from Python to R. Is Java really that different from Jurassic World? Could Scala be the tech version of Spiderman?
Furthermore, exactly like the way your Legos can go from one desk to taking up your entire basement, Spark can be scaled from one small machine to part of a cluster. Regarding the building blocks of Spark, using the different modules to create a Spark stack is exactly like stacking Lego bricks on top of a base. Think of how easy it is to build different ideas on the same platform—both in Lego and in Spark. In fact, some people will tell you that using Spark is even more fun than building with Legos!
In this blog post, I made a connection between Spark and Legos, and reviewed Spark deployment and storage options, the building blocks of the Spark stack, and Resilient Distributed Datasets.
If you have any additional questions about using Spark, please ask them in the comments section below.
Want to learn more?
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.