Run your R (SparklyR) workloads at scale with Spark-on-Kubernetes

Tutorial: How to build the right Docker image, start your Spark session, and run at scale!

Jean Yves
Towards Data Science

--

R is a programming language for statistical computing. It is widely used among statisticians and data scientists. Running applications on a single machine has been sufficient for a long time, but it has become a limiting factor when more data and advanced analysis is required.

That’s why the R community has developed sparklyr to scale data engineering, data science and machine learning using Apache Spark. It supports the Apache Spark use cases: Batch, Streaming, ML and Graph, SQL, in addition to the well-known R packages: dplyr, DBI, broom. More information can be found on sparklyr.ai.

How SparklyR builds on top of Spark (Source: sparklyr.ai, Reposted under Apache License 2.0 authorizing commercial use)

The problem is, the integration between Sparklyr and Apache Spark is brittle, it’s hard to get the right mix of libraries and environment setup. One of our customers tried to get this to work on EMR and described it as “a nightmare”. On the contrary, by building their own Docker images and running them on our Spark-on-Kubernetes platform, he was able to make his SparklyR setup reliably work.

So let’s see how to get your SparklyR applications running at scale using Spark-on-Kubernetes! All of the code for this tutorial is available on this Github repository.

Requirements

You must configure a Docker image. This is the most difficult part but we did it for you! The following Dockerfile uses one our published image as a base — see this blog post and our dockerhub repository for more details on these images.‍

Code by Author. Full Public Repo.

You can tune your packages in the RUN install2.r section. Tidyverse contains many well known packages like dplyr, ggplot2. Once your image is built and available in your registry it contains all your dependencies and takes a few seconds to load when you run your applications.

Develop your SparklyR application

We will show you a few code samples to start with. You can find more examples in the sparklyr github repo. There are two critical topics:

  • Creating the Spark Session
  • Understanding that your R object is an interface to a Spark Dataframe or to a R dataset.

Experienced Sparklyr developers can look at the Spark session creation and then switch directly to Run your Spark applications at scale.

Create the Spark Session

Code by Author. Full Public Repo.

How to manipulate your R Objects and Spark Dataframes

There’s no better way to learn than going through a code example (see below)! Here are the main things to pay attention to:

  • The sparklyr copy_to function returns a reference to the generated Spark DataFrame as a tbl_spark. The returned object will act as a dplyr-compatible interface to the underlying Spark table (see docs).
  • You can apply an R function to a Spark dataframe by using spark_apply
  • You can cache (or uncache) Spark dataframes explicitly by calling tbl_cache (or tbl_uncache)
  • You can query Spark tables with SQL with dbGetQuery
  • You can read (or write) Parquet tables with spark_read_parquet (or spark_write_parquet)
  • You can plot your your R objects using the ggplot2 package.
  • Don’t forget to close your Spark session at the end of your application code with spark_disconnect. You can still run R code after this line, but won’t be able to run any distributed command with Spark.
Code by Author. Full Public Repo.

R‍un your Spark applications at scale

You must first define a Data Mechanics configuration through a template or a configOverride. For open-source Spark-on-Kubernetes users, it’s easy to adapt this configuration, particularly if you use the open-source project Spark-on-Kubernetes operator.

Example Data Mechanics configuration (JSON). Source: Author.

‍The code to be executed is in the file RExamples.R which was copied in the Docker Image, which is why the mainApplicationFile points to a local path inside the Docker image.

You can then monitor and optimize your Spark applications with Delight, an open-source monitoring UI for Spark which works on top of any Spark platform (commercial/open-source, cloud/on on premise, etc).

Screenshot from the Delight UI. Source: Author.

Special thanks go to our customers running SparklyR workloads on the Data Mechanics platform for sharing their tricks and setup. We hope this tutorial will help you be successful with Spark and R!

Originally published at https://www.datamechanics.co.

--

--

Co-Founder @Data Mechanics, The Cloud-Native Spark Platform Senior Product Manager @ Spot.io — Building Ocean for Spark Former software eng @Databricks.