Duration:
6 Days
Course Outline
Introduction to Big Data and Hadoop
- What is Big Data?
- Types of Data
- Need for Big Data
- Characteristics of Big Data
- Traditional IT Analytics Approach
- Big Data—Use Cases
- Handling Limitations of Big Data
- Introduction to Hadoop
- History and Milestones of Hadoop
Getting Started with Hadoop
- Virtual Box / VMware Player—Introduction
- Installing Virtual Box / VMware Player
- Setting up the Virtual Environment
- Installation of Hadoop VM
Hadoop Architecture
- Hadoop Cluster on commodity hardware
- Hadoop core services and components
- Regular file system vs. Hadoop
- HDFS Features
- HDFS operations
MapReduce
- Introduction to MapReduce
- Hadoop MapReduce example
- Hadoop MapReduce Characteristics
- Setting up your MapReduce Environment
- Building a MapReduce Program
- MapReduce Requirements and Features
- Data Types
- MapReduce Java Programming in
- Eclipse
- Checking Hadoop Environment for
- MapReduce
YARN
- What is YARN
- Why need YARN
- YARN Architecture
PIG
- Background
- Pig Architecture
- Data Types
- Data Loading and storage
- Data Transformation
- PIG : Syntax, Example and Hands On
- Examples using pig Scripts
- Hands-On Real time Project on Pig
HIVE
- Background
- HIVE Architecture
- Metastore
- Data Types
- Data Loading and storage
- Data Transformation
- HIVE : Syntax, Example and Hands
- On Examples using Hive Scripts
- User Defined Functions
- Hands-On Real time Project on HIVE
SQOOP
- Introduction to data ingestion tool
- Data transfer from RDBMS into
- HDFS, HIVE
- Data transfer from HDFS
- Other Operations
Introduction to Python
- Python Programming
- Data Types and Strings
- Flow Constructs
- Functions
- List and dictionary
- File Input and output
- Array using Numpy
- Plotting using MatPlotLib
- DataFrames using Pandas
- Data Analysis
Getting Started with Spark
- Download Spark
- Install Spark
- Spark Languages
- Using the Spark Shell
Spark Core Concepts
- Resilient Distributed Datasets (RDDs)
- Functional Programming with Spark
- Working with RDDs
- RDD Operations
- Key-Value Pair RDDs
- Pair RDD Operations
- Load Data File into Spark
- Save Files
- Data Partitioning
Running Spark on a Cluster
- A Spark Standalone Cluster
- The Spark Standalone Web UI
- Spark on Hadoop Cluster
- Scheduling
Parallel Programming with Spark
- RDD Partitions
- HDFS Data Locality
- Executing Parallel Operations
Caching and Persistence
- RDD Lineage
- Caching Overview
- Distributed Persistence
Spark SQL
- SchemaRDD
- DataFrame and Dataset
- SparkSession
- SQL Operations
Spark Mlib
- What is Machine Learning
- Supervised Machine Learning
- Unsupervised Machine Learning
- Algorithms used in Machine Learning
- Data Types in MLib
- Building Machine Learning Applications
Advanced Spark Features
- Spark Performance
- Shared Variables: Broadcast Variables
- Shared Variables: Accumulators
- Common Performance Issues
- Concurrency Limitation
- Security Features
- Memory Usage and Garbage Collection
- Serialization
Spark and the Hadoop Ecosystem
Spark vs. MapReduce Programming
Major Projects
- Project 1
- Movie Recommendation
- Project 2
- Self Designed Project
Interview Questions and Quiz Discussion
Test & Evaluation
Each lecture will have a quiz containing a set of multiple-choice questions. Apart from that, there will be a final test based on multiple-choice questions.
Your evaluation will include the overall scores achieved in each lecture quiz and the final test.