Delight is a free, hosted, cross-platform monitoring dashboard for Apache Spark with memory and CPU metrics that will hopefully delight you!

A year ago, we released a widely shared blog post on TowardsDataScience: We’re building a better Spark UI!

Today, after a lot of engineering work, we’re proud to finally release Delight, our free, hosted, cross-platform monitoring dashboard for Apache Spark.

It works on top of Databricks, EMR, Dataproc, HDInsight, HDP/CDP, and open-source platforms too (anything spark-submit, or using the spark-on-kubernetes operator, or using Apache Livy).

Delight helps you understand and improve the performance of your Spark applications. It provides:

  • Spark-centric CPU & memory metrics that we hope will delight you!
  • The Spark UI — so you don’t need to run…

If you’re looking for a high-level introduction about Spark on Kubernetes, check out The Pros And Cons of Running Spark on Kubernetes, and if you’re looking for a deeper technical dive, then read our guide Setting Up, Managing & Monitoring Spark on Kubernetes.

Data Mechanics is a managed Spark platform deployed on a Kubernetes cluster inside our customers’ cloud account, available on AWS, GCP, and Azure. So our entire company is built on top of Spark on Kubernetes, and we are often asked how we’re different from simply running Spark on Kubernetes open-source.

The short answer is that our platform…

Get started and do your work with all the common data sources supported by Spark.

Our optimized Docker images for Apache Spark are now freely available on our DockerHub repository, whether you’re a Data Mechanics customer or not.

This is the result of a lot of work from our engineering team:

  • We built a fleet of Docker Images combining various versions of Spark, Python, Scala, Java, Hadoop, and all the popular data connectors.
  • Automatically tested them across various workloads, to ensure the included dependencies are working together — in other words, save you from dependency hell 😄

Our philosophy is to provide high quality Docker images that come “with batteries included”, meaning you will be…

In this article, the co-founder of Lingk tells the story of their migration from EMR to the Spark-on-Kubernetes platform managed by Data Mechanics: their goals, the architecture of the solution & challenges they had to address, and the results they obtained.

Goals of this migration is a data loading, data pipelines, and integration platform built on top of Apache Spark, serving commercial customers, with expertise in the education sector. In a few clicks from their visual interface, their customers can load, deduplicate, and enrich data from dozens of sources.

Under the hood, Lingk used AWS EMR (ElasticMapReduce) to power their product. …

With the Apache Spark 3.1 release in March 2021, the Spark on Kubernetes project is now officially declared as production-ready and Generally Available. This is the achievement of 3 years of fast-growing community contributions and adoptions of the project — since initial support for Spark-on-Kubernetes was added in Spark 2.3 (February 2018). In this article, we will go over the main features of Spark 3.1, with a special focus on the improvements to Spark-on-Kubernetes.

Related Resources:

Recent developments with Spark 3.0, Spark-on-Kubernetes going GA, PySpark usability improvements, and more.

Source: Unsplash. Not a photo of the actual conference, which was entirely online this year.

November 20th, 2020: I just attended the first edition of the Data + AI Summit — the new name of the Spark Summit conference organized twice a year by Databricks. This was the European edition, meaning the talks took place at a European-friendly time zone. In reality it drew participants from everywhere, as the conference was virtual (and free) because of the epidemic.

The conference featured 125 pre-recorded video talks aired at a specific time (as if they were live), with the speakers available to answer questions in real time on a “Live Chat”. …

Introducing Delight, an open-source project providing a free & cross-platform monitoring dashboard for Apache Spark

For “live” Spark applications, accessing the Spark UI — the open-source monitoring interface of Apache Spark — is easy, as it is served directly by the Spark driver on port 4040.

Spark UI Screenshot. Image by Author.

But once an application has completed, accessing the Spark UI requires setting up a Spark History Server, which takes a bit of work. In this article, we’re going to show you how to access the Spark UI for completed Spark applications with very little effort by leveraging a free open-source monitoring project called Data Mechanics Delight (see the github page).

Update(April 2021): Delight has now been officially released!

How to install our free, open-source, Spark UI & Spark History Server on top of any Spark platform

Accessing the Spark UI for live Spark applications is easy, but having access to it for terminated applications requires persisting logs to a cloud storage and running a server called the Spark History Server. While some commercial Spark platform provides this automatically (e.g. Databricks, EMR) ; for many Spark users (e.g. on Dataproc, Spark-on-Kubernetes) getting this requires a bit of work.

Today, we’re releasing a free, hosted, partly open sourced, Spark UI and Spark History Server that work on top of any Spark platform (whether it’s on-premise or in the cloud, over Kubernetes or over YARN, using a commercial platform…

How our startup Data Mechanics (YCombinator, S19) builds on top of the open-source project

Note: If you’re looking for an introduction to Spark on Kubernetes — what is it, what’s its architecture, why is it beneficial — start with The Pros And Cons of Running Spark on Kubernetes. For a one-liner introduction, let’s just say that Spark native integration with Kubernetes (instead of Hadoop YARN) generates a lot of interest from the community and is about to be declared Generally Available and Production-Ready with Spark 3.1.

I’m an ex-Databricks engineer, now co-founder of Data Mechanics, a managed Spark platform deployed on a Kubernetes cluster inside our customers cloud account (AWS, Azure, or GCP). …

Spark & Docker Development Iteration Cycle. Image by author.
Spark & Docker Development Iteration Cycle. Image by author.
Spark & Docker Development Iteration Cycle. Image by author.

The benefits that come with using Docker containers are well known: they provide consistent and isolated environments so that applications can be deployed anywhere — locally, in dev / testing / prod environments, across all cloud providers, and on-premise — in a repeatable way.

The software engineering world has fully adopted Docker, and a lot of tools around Docker have changed the way we build and deploy software — testing, CI/CD, dependency management, versioning, monitoring, security. The popularity of Kubernetes as the new standard for container orchestration and infrastructure management follows from the popularity of Docker.

In the big data…

Jean Yves

Co-Founder @Data Mechanics, The Cloud-Native Spark Platform For Data Engs. We make Spark more dev-friendly & cost-effective! Former software eng @Databricks.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store