Course duration
- 3 days
Course Benefits
- Understand the need for Spark in data processing
- Understand the Spark architecture and how it distributes computations to cluster nodes
- Be familiar with basic installation / setup / layout of Spark
- Use the Spark shell for interactive and ad-hoc operations
- Understand RDDs (Resilient Distributed Datasets), and data partitioning, pipelining, and computations
- Understand/use RDD ops such as map(), filter() and others.
- Understand and use Spark SQL and the DataFrame API.
- Understand the DataFrame capabilities, including the Catalyst query optimizer and Tungsten memory/cpu optimizations.
- Be familiar with performance issues, and use DataFrames and Spark SQL for efficient computations
- Understand Spark’s data caching and use it for efficient data transfer
- Write/run standalone Spark programs with the Spark API
- Use Spark Structured Streaming to process streaming (real-time) data
- Ingest streaming data from Kafka, and process via Spark Structured Streaming
- Understand performance implications and optimizations when using Spark
Public expert-led online training from the convenience of your home, office or anywhere with an internet connection. Guaranteed to run .
Private classes are delivered for groups at your offices or a location of your choice.
Course Outline
- Introduction to Spark
- Overview, Motivations, Spark Systems
- Spark Ecosystem
- Spark vs. Hadoop
- Acquiring and Installing Spark
- The Spark Shell, SparkContext
- RDDs and Spark Architecture
- RDD Concepts, Lifecycle, Lazy Evaluation
- RDD Partitioning and Transformations
- Working with RDDs - Creating and Transforming (map, filter, etc.)
- Spark SQL, DataFrames, and DataSets
- Overview
- SparkSession, Loading/Saving Data, Data Formats (JSON, CSV, Parquet, text ...)
- Introducing DataFrames (Creation and Schema Inference)
- Supported Data Formats (JSON, Text, CSV, Parquet)
- Working with the DataFrame (untyped) Query DSL (Column, Filtering, Grouping, Aggregation)
- SQL-based Queries
- Mapping and Splitting (flatMap(), explode(), and split())
- DataFrames vs. RDDs
- Shuffling Transformations and Performance
- Grouping, Reducing, Joining
- Shuffling, Narrow vs. Wide Dependencies, and Performance Implications
- Exploring the Catalyst Query Optimizer (explain(), Query Plans, Issues with lambdas)
- The Tungsten Optimizer (Binary Format, Cache Awareness, Whole-Stage Code Gen)
- Performance Tuning
- Caching - Concepts, Storage Type, Guidelines
- Minimizing Shuffling for Increased Performance
- Using Broadcast Variables and Accumulators
- General Performance Guidelines
- Creating Standalone Applications
- Core API, SparkSession.Builder
- Configuring and Creating a SparkSession
- Building and Running Applications - sbt/build.sbt and spark-submit
- Application Lifecycle (Driver, Executors, and Tasks)
- Cluster Managers (Standalone, YARN, Mesos)
- Logging and Debugging
- Spark Streaming
- Introduction and Streaming Basics
- Streaming Introduction
- Structured Streaming (Spark 2+)
- Continuous Applications
- Table Paradigm, Result Table
- Steps for Structured Streaming
- Sources and Sinks
- Consuming Kafka Data
- Kafka Overview
- Structured Streaming - "kafka" format
- Processing the Stream
Class Materials
Each student will receive a comprehensive set of materials, including course notes and all the class examples.
Experience in the following is required for this Spark class:
- Working knowledge of some programming language. No Python experience necessary.
Instructor-led courses are offered via a live Web connection, at client sites throughout Europe, and at our Geneva Training Center.