If you’re looking for a high-level introduction about Spark on Kubernetes, check out The Pros And Cons of Running Spark on Kubernetes, and if you’re looking for a deeper technical dive, then read our guide Setting Up, Managing & Monitoring Spark on Kubernetes.
Data Mechanics is a managed Spark platform deployed on a Kubernetes cluster inside our customers’ cloud account, available on AWS, GCP, and Azure. So our entire company is built on top of Spark on Kubernetes, and we are often asked how we’re different from simply running Spark on Kubernetes open-source.
The short answer is that our platform implements many features which make Spark on Kubernetes much more easy-to-use and cost-effective. By taking care of the setup and the maintenance, our goal is to let you focus on as well as accelerate its adoption and save you a lot of maintenance work. Our goal is to accelerate your data engineering projects by making Spark as simple, flexible and performant as it should be. …
November 20th, 2020: I just attended the first edition of the Data + AI Summit — the new name of the Spark Summit conference organized twice a year by Databricks. This was the European edition, meaning the talks took place at a European-friendly time zone. In reality it drew participants from everywhere, as the conference was virtual (and free) because of the epidemic.
But once an application has completed, accessing the Spark UI requires setting up a Spark History Server, which takes a bit of work. In this article, we’re going to show you how to access the Spark UI for completed Spark applications with very little effort by leveraging a free open-source monitoring project called Data Mechanics Delight (see the github page).
Accessing the Spark UI for live Spark applications is easy, but having access to it for terminated applications requires persisting logs to a cloud storage and running a server called the Spark History Server. While some commercial Spark platform provides this automatically (e.g. Databricks, EMR) ; for many Spark users (e.g. on Dataproc, Spark-on-Kubernetes) getting this requires a bit of work.
Today, we’re releasing a free, hosted, partly open sourced, Spark UI and Spark History Server that work on top of any Spark platform (whether it’s on-premise or in the cloud, over Kubernetes or over YARN, using a commercial platform or not). …
Note: If you’re looking for an introduction to Spark on Kubernetes — what is it, what’s its architecture, why is it beneficial — start with The Pros And Cons of Running Spark on Kubernetes. For a one-liner introduction, let’s just say that Spark native integration with Kubernetes (instead of Hadoop YARN) generates a lot of interest from the community and is about to be declared Generally Available and Production-Ready with Spark 3.1.
The benefits that come with using Docker containers are well known: they provide consistent and isolated environments so that applications can be deployed anywhere — locally, in dev / testing / prod environments, across all cloud providers, and on-premise — in a repeatable way.
The software engineering world has fully adopted Docker, and a lot of tools around Docker have changed the way we build and deploy software — testing, CI/CD, dependency management, versioning, monitoring, security. The popularity of Kubernetes as the new standard for container orchestration and infrastructure management follows from the popularity of Docker.
In the big data and Apache Spark scene, most applications still run directly on virtual machines without benefiting from containerization. The dominant cluster manager for Spark, Hadoop YARN, did not support Docker containers until recently (Hadoop 3.1 release), and even today the support for Docker remains “experimental and non-complete”. …
Earlier this year at Spark + AI Summit, we had the pleasure of presenting our session on the best practices and pitfalls of running Apache Spark on Kubernetes (K8s).
In this post we’d like to expand on that presentation and talk to you about:
If you’re already familiar with k8s and why Spark on Kubernetes might be a fit for you, feel free to skip the first couple of sections and get straight to the meat of the post! …
Our mission at Data Mechanics is to give data engineers and data scientists the ability to build pipelines and models over large datasets with the simplicity of running a script on their laptop. Let them focus on their data, while we handle the mechanics of infrastructure management.
So, we built a serverless Spark platform, a more easy-to-use and more performant alternative to services like Amazon EMR, Google Dataproc, Azure HDInsight, Databricks, Qubole, Cloudera, and Hortonworks.
In this video, we will give you a product tour of our platform and some of its core features:
Apache Spark is an open-sourced distributed computing framework, but it doesn’t manage the cluster of machines it runs on. You need a cluster manager (also called a scheduler) for that. The most commonly used one is Apache Hadoop YARN. Support for running Spark on Kubernetes was added with version 2.3, and Spark-on-k8s adoption has been accelerating ever since.
If you’re curious about the core notions of Spark-on-Kubernetes, the differences with Yarn as well as the benefits and drawbacks, read our previous article: The Pros And Cons of Running Spark on Kubernetes. …
The Apache Spark UI, the open source monitoring tool shipped with Apache® Spark is the main interface Spark developers use to understand their application performance. And yet, it generates a LOT of frustrations. We keep hearing it over and over, from Apache Spark beginners and experts alike:
So our team has set out to replace the Spark UI and Spark History Server with something delightful, a free, cross-platform and partially open-source tool called Data Mechanics Delight. …