Duration
3 days
Course Outline
Getting Started with Spark
- Download Spark
- Install Spark
- Spark Languages
- Using the Spark Shell
Introduction to Scala
- Functional Programming
- Object-Oriented Programming
- Features of Scala
- Programming with Scala
- classes, case classes and Traits
Spark Core Concepts
- Resilient Distributed Datasets (RDDs)
- Functional Programming with Spark
- Working with RDDs
- RDD Operations
- Key-Value Pair RDDs
- Pair RDD Operations
- Load Data File into Spark
- Save Files
- Data Partitioning
Running Spark on a Cluster
- A Spark Standalone Cluster
- The Spark Standalone Web UI
- Spark on Hadoop Cluster
- Scheduling
Parallel Programming with Spark
- RDD Partitions
- HDFS Data Locality
- Executing Parallel Operations
Writing Spark Applications
- Building Spark Application using SBT
- Building Spark Application using Maven
- IDE setup
- Spark Applications vs. Spark Shell
- Creating the SparkContext
- Configuring Spark Properties
- Building and Running a Spark Application
- Deploying Application on Cluster
- Logging
Caching and Persistence
- RDD Lineage
- Caching Overview
- Distributed Persistence
Spark SQL
- SchemaRDD
- DataFrame and Dataset
- SparkSession
- SQL Operations
Spark Streaming
- Spark Streaming Overview
- Example: Streaming Word Count
- Other Streaming Operations
- Sliding Window Operations
- Developing Spark Streaming Applications
Advanced Spark Features
- Spark Performance
- Shared Variables: Broadcast Variables
- Shared Variables: Accumulators
Common Performance Issues
- Concurrency Limitation
- Security Features
- Memory Usage and Garbage Collection
- Serialization
Test & Evaluation
Each lecture will have a quiz containing a set of multiple-choice questions. Apart from that, there will be a final test based on multiple-choice questions.
Your evaluation will include the overall scores achieved in each lecture quiz and the final test.