Publications

High Performance Spark is now available for purchase

High Performance Spark

Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

A three part post on using the yarn api to set optimal Spark settings:

How to Use the YARN API to Determine Resources Available for Spark Application Submission (Part 1)

At Alpine we continue to deliver new enterprise analytic features within Chorus. With Chorus 6.1 we launched the ability to deliver sophisticated auto-tuning for Spark jobs. Chorus automatically determines the settings needed to launch a Spark Application by using information on the size of the data

How to Use the YARN API to Determine Resources Available for Spark Application Submission (Part 2)

Welcome to the second part of this introduction to the YARN API. Today I will walk through the resources available for a job and how to find your queue. How Many Total Resources are Available for My Job?

How to Use the YARN API to Determine Resources Available for Spark Application Submission (Part 3)

Welcome to the final section of this introduction to the YARN API. Capacity Scheduler The capacity scheduler is pretty straight forward. It assigns each user a percentage of the parent queue that they are allowed to use. The queue object corresponding to your user, will have an “absoluteCapac