Spark is a rising star in the big data world, usually surpassing Hadoop performance by at least an order of magnitude — and recently also surpassing Hadoop in terms of adoption! It is also more intuitive to use because you don’t have to force every problem into the confines of map and reduce operations. At the same time, Spark is compatible with the Hadoop distributed filesystem (HDFS), execution engine, and deployment tools. In this workshop you will learn how to use Spark for one-off data analysis and investigations, how to build and submit Spark jobs, and how to use some higher-level libraries such as Spark SQL. We will also cover the basics of the Scala programming language required for working with Spark.
Spark is one of these things you can’t learn without a lot of hands-on work: it involves some new concepts, probably a new language (Scala or Python), and a slightly different mindset where you have to carefully deconstruct operations so that they can be efficiently performed in a distributed manner. This is why this workshop is accompanied by multiple hands-on labs: using basic Spark transformations and actions, parsing logs and data files, compiling and submitting Spark programs, and many others. Overall, expect to spend 50% of the time building Spark programs and discussing your work with the group.