The Story of a Migration from EMR to Spark on Kubernetes

Jean Yves
Towards Data Science
5 min readApr 27, 2021

--

In this article, the co-founder of Lingk tells the story of their migration from EMR to the Spark-on-Kubernetes platform managed by Data Mechanics: their goals, the architecture of the solution & challenges they had to address, and the results they obtained.

Goals of this migration

Lingk.io is a data loading, data pipelines, and integration platform built on top of Apache Spark, serving commercial customers, with expertise in the education sector. In a few clicks from their visual interface, their customers can load, deduplicate, and enrich data from dozens of sources.

Under the hood, Lingk used AWS EMR (ElasticMapReduce) to power their product. But they were facing a few issues:

  • EMR required too much infrastructure management for their Devops team with limited Spark experience. Picking the right cluster instance types, memory settings, spark configs, etc.
  • Their total AWS costs were high — they had the intuition that the autoscaling policies of EMR were not very efficient, and that a lot of compute ressources were wasted.
  • Spark apps took 40 seconds to start on average. It’s a long time during which Lingk’s end users had to wait, particularly if they’re building a new data pipeline or integration.
  • The core Spark application was stuck at an earlier version because upgrading Spark to 3.0+ caused unexplained performance regressions.

They decided to migrate to Spark on Kubernetes, with the help of Data Mechanics. They had 3 goals with this migration:

  1. Reduce their total infrastructure costs
  2. Streamline their data team operational work
  3. Improve the end-user experience of the Lingk platform

* Data Mechanics is a cloud-native Spark platform, available on AWS, GCP, and Azure. Read more about their service and how it builds upon Spark-on-Kubernetes open-source.

Target architecture after the migration

Lingk’s new data platform is built upon a long-running managed Kubernetes cluster (EKS), deployed inside their AWS account. Data Mechanics manages this cluster, from initial provisioning, to maintenance over time (keeping it up to date with the latest Spark & Kubernetes updates), as well as automatically scaling it on-demand based on the load.

Lingk +Data Mechanics Architecture. Image by Author.

Additional integrations like Jupyter notebook support and Airflow integration are also possible, though in this case, Lingk would simply trigger Spark jobs using a REST API exposed by the Data Mechanics gateway. So Lingk’s team does not need to manage EMR clusters anymore, they just submit Dockerized Spark apps through the API, and enjoy a serverless experience.

The team has control over the docker images used by Spark, which brings 3 additional benefits:

  1. Applications start more quickly — as all dependencies are baked in the Docker image.
  2. The CI/CD flow is simpler — A Docker image is built automatically when a PR is merged.
  3. The Docker image includes the Spark distribution itself (there is no global Spark version), which means all applications can efficiently run on the same cluster, and it was easy to gradually upgrade to Spark 3.0.

It was easy to get started with the docker-based development workflow, and to package the required dependencies, using one of the public Spark-on-Kubernetes docker image provided by Data Mechanics as a base.

Challenges addressed during the migration

The main technical challenge of the migration was to stop using HDFS for intermediate storage, and use S3 instead — as HDFS is hard to setup and maintain on Kubernetes. This required some application code change, but the final architecture is more robust, as the compute ressources are now fully separated from storage ressources (allowing the cluster to scale down almost entirely when it’s unused).

A few performance optimizations were also critical:

  1. Container sizes tuning to maximize bin-packing of containers on instances. Small-sized containers were used for most of their applications (which are short and handle small data volumes), and bigger-sized containers for the tail of longer applications. These settings are automatically tuned by the Data Mechanics platform.
  2. Tuning the default number of partitions to guarantee optimal parallelism within Spark — as many Spark jobs suffered from too many small partitions (visible from the Spark UI, as the average task duration was <50ms).
  3. Enabling dynamic allocation (which works for Spark on Kubernetes since Spark 3.0), to speed up long-running pipelines (by a factor of 5x for the 99th-percentile longest apps!) by letting them request more executors.
  4. Enabling a small amount of over-provisioning, to make sure the cluster always has spare capacity for apps to get started.
Example node pool configuration illustrating some of the optimizations we implemented. Image by Author.

These optimizations are detailed in the technical guide: How to Set up, Manage & Monitor Spark on Kubernetes (with code examples).

Results achieved by the migration

Results obtained in the EMR => Spark-on-K8S Migration. Image by Author.

The migration from EMR to Spark-on-Kubernetes was a big win:

  • In terms of end-user experience, the Spark application startup time was halved, and the average app duration decreased by 40%.
  • In terms of costs, the AWS costs were reduced by over 65%. The total cost of ownership for Lingk (including Data Mechanics management fee) was reduced by 33%. These savings came from the cost-efficiency of using a single Kubernetes cluster for all the Spark applications, and from the performance optimizations described in the previous section.

Lingk was also able to upgrade Spark to 3.0 (gradually, thanks to the dockerization of their Spark distribution), which enabled new user-facing features in their platform.

“Leveraging Data Mechanics Spark expertise and platform decreases cost while letting us sleep well at night and achieve the plans we dream about.”
Dale McCrory, Co-Founder & Chief Product Officer at Lingk.

Originally published on the Data Mechanics Blog.

--

--

Co-Founder @Data Mechanics, The Cloud-Native Spark Platform Senior Product Manager @ Spot.io — Building Ocean for Spark Former software eng @Databricks.