Duration
30 Hours (2 hours per Session, 15 Sessions)
Course Outline
Introduction to Big Data and Hadoop
- What is Big Data?
- Types of Data
- Need for Big Data
- Characteristics of Big Data
- Traditional IT Analytics Approach
- Big Data—Use Cases
- Handling Limitations of Big Data
- Introduction to Hadoop
- History and Milestones of Hadoop
Getting Started with Hadoop
- Virtual Box / VMware Player—Introduction
- Installing Virtual Box / VMware Player
- Setting up the Virtual Environment
- Installation of Cloudera VM
Hadoop Architecture
- Hadoop Cluster on commodity hardware
- Hadoop core services and components
- Regular file system vs. Hadoop
- HDFS layer
- HDFS Operations
MapReduce
- Introduction to MapReduce
- Hadoop MapReduce example
- Hadoop MapReduce Characteristics
- Setting up your MapReduce Environment
- Building a MapReduce Program
- MapReduce Requirements and Features
- Data Types
- MapReduce Java Programming in Eclipse
- Checking Hadoop Environment for MapReduce
YARN
- What is YARN
- Why need YARN
- YARN Architecture
PIG
- Background
- Pig Architecture
- Data Types
- Data Loading and storage
- Data Transformation
- PIG: Syntax, Example and Hands-On Examples using pig Scripts
- Hands-On Real-time Project on Pig
HIVE
- Background
- HIVE Architecture
- Metastore
- Data Types
- Data Loading and storage
- Data Transformation
- HIVE: Syntax, Example and Hands-On Examples using Hive Scripts
- User-Defined Functions
- Hands-On Real-time Project on HIVE
SQOOP
- Introduction to data ingestion tool
- Data transfer from RDBMS
- Data transfer from HDFS
Getting Started with Spark
- Download Spark
- Install Spark
- Spark Languages
- Using the pyspark
Spark Core Concepts
- Resilient Distributed Datasets (RDDs)
- Functional Programming with Spark
- Working with RDDs
- RDD Operations
- Key-Value Pair RDDs
- Pair RDD Operations
- Load Data File into Spark
- Save Files
- Data Partitioning
Running Spark on a Cluster
- A Spark Standalone Cluster
- The Spark Standalone Web UI
- Spark on Hadoop Cluster
- Spark on Cloud
- Scheduling
Parallel Programming with Spark
- RDD Partitions
- HDFS Data Locality
- Executing Parallel Operations
Caching and Persistence
- RDD Lineage
- Caching Overview
- Distributed Persistence
Spark SQL
- SchemaRDD
- DataFrame and Dataset
- SparkSession
- SQL Operations
Common Performance Issues
- Concurrency Limitation
- Security Features
- Memory Usage and Garbage Collection
- Serialization
Test & Evaluation
Each lecture will have a quiz containing a set of multiple-choice questions. Apart from that, there will be a final test based on multiple-choice questions.
Your evaluation will include the overall scores achieved in each lecture quiz and the final test.