About the course
This course is designed to help you become a production-ready Data Engineer on Google Cloud. With this course, you will get an in-depth understanding of all the GCP Services used in Big Data Analysis. As a Data Engineer, you build an end to end batch and streaming data pipelines and for this, you are required to design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on security and compliance; scalability and efficiency; reliability and fidelity; and flexibility and portability and during this course we will help you to understand all these aspects of Big Data Services of GCP.
This course will also help you pass the Professional Data Engineer - Google Cloud Certification.
24 hours (12 Sessions, 2 hours each)
- Building End to End Batch and Streaming Data pipelines
- Exploring various Data Processing Services on GCP
- Monitoring GCP Services
- Security and Access Control
- Automation of Workflows
- Derive business insights using Google BigQuery
This class is intended for experienced data analysts, big data developers, Business Analysts, and Data Analysts/ Data Engineers.
Test & Evaluation
Each lecture will have a quiz containing a set of multiple-choice questions. Apart from that, there will be a final test based on multiple-choice questions.
Your evaluation will include the overall scores achieved in each lecture quiz and the final test.
Knowledge of Python and SQL Programming is a must for Hands-on Labs. Knowledge of some ETL tools is an added advantage.
The course includes presentations, demonstrations, and hands-on labs.
Module 1: Introduction
- Introduction to Cloud Computing
- Why Google Cloud Platform
- Google Infrastructure
- Google Cloud Platform (GCP) Services
- Creating a free-tier account on GCP
- Exploring Google Cloud Console
- Using Cloud shell
Module 2: Data Engineering Introduction
- Explore the role of a data engineer
- Analyze data engineering challenges
- Data Lakes and Data Warehouses
- Transactional Databases vs Data Warehouses
- Manage data access and governance
- Build production-ready pipelines
Module 3: Building a Data Lake using Cloud Storage
- Introduction to Data Lakes.
- Building a Data Lake using Cloud Storage
- Cloud Storage Features
- Cloud Storage Classes
- Securing Cloud Storage
- Different transfer options
- Create Cloud Storage Bucket and uploading files and folders in it
- Creating and Accessing Cloud Storage Buckets using CLI
Module 4: Building a Data Warehouse using BigQuery
- The modern data warehouse Requirements
- Introduction to BigQuery
- Loading Data into BigQuery from various Data Sources
- Different Data Formats
- Schema Design with different Data Types
- Nested and Repeated Fields for denormalized Tables
- Optimizing with Partitioning and Clustering for efficiency
- Secure data using Authorized Views
- Controlling Access to Dataset using IAM
- Create Dataset and Tables
- Load and query data in tables
- Create partitioned and Clustered tables
- De-normalizing data using nested and repeated data types
- Optimizing queries
- Create Authorized Views
Module 5: Introduction to Data Processing
- Why preprocess data
- Quality considerations for Datawarehouse
- ETL to solve data quality issues.
- Data Processing Tools on GCP
Module 6: Cloud Dataproc as ETL Tool
- The Hadoop ecosystem.
- Cloud Dataproc as managed Hadoop Cluster
- Running Hadoop Jobs (pig, Hive and Spark jobs) on Dataproc
- Data Storage in GCS instead of HDFS
- Optimizing Hadoop Jobs on Dataproc
- Create Hadoop cluster thru Dataproc from command-line and Console
- Running Spark job from Dataproc by reading/writing data from GCS
- Running Spark job from Dataproc by reading/writing data from BigQuery
Module 7: Serverless Data Processing with Cloud Dataflow
- Apache Beam as Unified Platform
- Apache Beam Features for Batch Data
- Cloud Dataflow as Execution Environment
- Running Batch Data pipelines on Dataflow
- Dataflow Templates
- Dataflow SQL
- Create a data pipeline job using Apache Beam and run on the local machine
- Running a data pipeline job on DataFlow by reading data from GCS and writing data into BigQuery
Module 8: Manage Data Pipelines with Cloud Data Fusion and Cloud Composer
- Building Batch Data Pipelines visually with Cloud Data Fusion.
- Scheduling workflow on GCP services with Cloud Composer.
- Apache Airflow Environment.
- DAGs and Operators.
- Monitoring and Logging using Stackdriver
- Create and run Data pipeline on Data Fusion
- Automating scheduling for creation/termination of Dataproc cluster and run spark job using cloud Composer
Module 9: Serverless Messaging with Cloud Pub/Sub
- Cloud Pub/Sub
- Create a topic on pub/sub
- Create a producer to write messages on pub/subtopic
- Create a consumer to read messages from pub/sub
Module 10: Cloud Dataflow Streaming Features
- Cloud Dataflow/Apache Beam Streaming Features
- Windowing Functions
- Watermark and late data
- Building a streaming application by reading data from pub/sub, preprocess using DataFlow and store into
Module 11: Visualization Tool on GCP
- Data Studio
- Data Studio Sources
Module 12: Data Profiling Tool on GCP - Dataprep
- Cleaning and profiling data with Cloud Dataprep
- Combining multiple datasets using Cloud Dataprep
- Computing the results of formulas in Cloud Dataprep
Module 13: Datalab
- Install and Setting up Data Lab
- Setting Data Science project on Datalab
- Reading data from Google Cloud Storage
Module 14: Cloud Functions
- What is even driven microservices?
- How to trigger functions based upon events in GCP
- Cloud Functions Features
- Demo: Converting images into thumbnail on an event through Cloud Function
Note: Every module would be having 2 or more hands-on labs on Google Cloud Platform.
For inquiry call: 9910043510