Beginning last year, Duke OIT’s Mark McCahill started working with Apache Spark as a resource for students in the Statistical Computing and Computation (STA663) class taught by Cliburn Chan and Janice McCarthy of the Department of Biostatistics and Bioinformatics. This initial foray yielded some good results — both in terms of proving utility of the tools for teaching but also pointing out the complexity and limitations of Spark’s underlying setup. His work has proceeded and now is in a real “pilot” phase, with students involved in the Data+ program of the Information Initiative at Duke (iiD) as the primary testers.

This evolution of Apache Spark at Duke adds to a list of high-performance computing options for Duke researchers.

“Spark distributes the computing work across a number of servers/CPUs, but the real ingenuity comes in how Spark can optimize and organize computational tasks,” McCahill said. “Say you want to run a computation on a subset of a large data collection. Some applications will run the compute operation on the entire dataset first and then pull out the subset of interest. That’s really inefficient, because you’re computing and then throwing away work that wasn’t of interest to begin with. Spark has optimizers that can evaluate what you want to accomplish, then optimize code to minimize useless work so things are done more efficiently.”

McCahill’s setup of Spark also uses Jupyter as an interactive front-end to Spark. Including Jupyter creates a rich web-browser-based interface that can serve many purposes, such as drawing graphs and data visualizations, and provides a graphical interface that users are familiar with. “Jupyter is one option — and a very effective one — but other front ends also make users’ experiences friendlier,” McCahill said. “R-Studio and Shiny are others. With a bit of study of the Livy APIs (an API to provide remote access to a Spark cluster), you can build your own interfaces or even create a machine-to-machine interface.”

McCahill used Research Toolkits‘ RAPID virtual machines for his initial setup for Chan’s and McCarthy’s statistics course. The pilot going through this summer uses a mix of RAPID machines and computers that typically are less used during the summer months. “It goes to show that RAPID machines aren’t just for smaller computational work,” McCahill explained. “RAPID made it easy to experiment with different cluster configurations, because it was quick to set up and tear down VMs, then use Spark to tie a collection of VMs together into a Spark cluster.”

The next step? As the summer months wane, McCahill will shift to create a Spark instance that can be applied to research computing. The lessons of the summer pilot will have a smooth landing — and will begin a fall takeoff for a new computing option for Duke researchers.

About Apache Spark

Apache Spark (https://spark.apache.org) is an open-source project that is inspired by Google’s MapReduce and the Apache Hadoop project. Although people sometimes misunderstand Spark as a replacement for Hadoop, that’s actually not the case. Hadoop’s features have to do with managing, storing, and indexing large data; Spark focuses on speeding up analysis of large data and isn’t primarily a tool for data storage. The two systems work separately, but they work especially well together. Because computation in Spark is done in RAM, it’s blazingly fast — up to 100 times faster than MapReduce — and Spark is well suited to use data stored in the Hadoop File System (HDFS).

Speed isn’t the only advantage. Spark allows developers and analysts to use a broader range of programming languages than does Hadoop. Python, Scala, and Java are supported. And Spark can handle SQL database queries as well and has capability to function as a “column-oriented database.”

Spark started out in UC-Berkeley’s AMPLab and was the subject of Matei Zaharia’s 2014 PhD dissertation “An Architecture for Fast and General Data Processing on Large Clusters.”