Data Engineering on Google Cloud Platform (DEGCP) – Outline

Detailed Course Outline

Module 01 - Introduction to Data Engineering

Topics:

  • Explore the role of a data engineer.
  • Analyze data engineering challenges
  • Introduction to BigQuery
  • Data lakes and data warehouses
  • Transactional databases versus data warehouses
  • Partner effectively with other data teams
  • Manage data access and governance
  • Build production-ready pipelines
  • Review Google Cloud customer case study

Objectives:

  • Understand the role of a data engineer
  • Discuss benefits of doing data engineering in the cloud
  • Discuss challenges of data engineering practice and how building data pipelines in the cloud helps to address these
  • Review and understand the purpose of a data lake versus a data warehouse, and when to use which

Activities:

  • Lab: Using BigQuery to do Analysis

Module 02 - Building a Data Lake

Topics:

  • Introduction to data lakes
  • Data storage and ETL options on Google Cloud
  • Building a data lake using Cloud Storage
  • Securing Cloud Storage
  • Storing all sorts of data types
  • Cloud SQL as a relational data lake

Objectives:

  • Understand why Cloud Storage is a great option for building a data lake on Google Cloud
  • Learn how to use Cloud SQL for a relational data lake

Activities:

  • Lab: Loading Taxi Data into Cloud SQL

Module 03 - Building a Data Warehouse

Topics:

  • The modern data warehouse
  • Introduction to BigQuery
  • Getting started with BigQuery
  • Loading data
  • Exploring schemas
  • Schema design
  • Nested and repeated fields
  • Optimizing with partitioning and clustering

Objectives:

  • Discuss requirements of a modern warehouse
  • Understand why BigQuery is the scalable data warehousing solution on Google Cloud
  • Understand core concepts of BigQuery and review options of loading data into BigQuery

Activities:

  • Lab: Loading Data into BigQuery
  • Lab: Working with JSON and Array Data in BigQuery

Module 04 - Introduction to Building Batch Data Pipelines

Topics:

  • EL, ELT, ETL
  • Quality considerations
  • How to carry out operations in BigQuery
  • Shortcomings
  • ETL to solve data quality issues

Objectives:

  • Review different methods of loading data into your data lakes and warehouses: EL, ELT, and ETL
  • Discuss data quality considerations and when to use ETL instead of EL and ELT

Module 05 - Executing Spark on Dataproc

Topics:

  • The Hadoop ecosystem
  • Run Hadoop on Dataproc
  • Cloud Storage instead of HDFS
  • Optimize Dataproc

Objectives:

  • Review the parts of the Hadoop ecosystem
  • Learn how to lift and shift your existing Hadoop workloads to the cloud using Dataproc
  • Understand considerations around using Cloud Storage instead of HDFS for storage
  • Learn how to optimize Dataproc jobs

Activities:

  • Lab: Running Apache Spark jobs on Dataproc

Module 06 - Serverless Data Processing with Dataflow

Topics:

  • Introduction to Dataflow
  • Why customers value Dataflow
  • Dataflow pipelines
  • Aggregating with GroupByKey and Combine
  • Side inputs and windows
  • Dataflow templates
  • Dataflow SQL

Objectives:

  • Understand how to decide between Dataflow and Dataproc for processing data pipelines
  • Understand the features that customers value in Dataflow
  • Discuss core concepts in Dataflow
  • Review the use of Dataflow templates and SQL

Activities:

  • Lab: A Simple Dataflow Pipeline (Python/Java)
  • Lab: MapReduce in Dataflow (Python/Java)
  • Lab: Side inputs (Python/Java)

Module 07 - Manage Data Pipelines with Cloud Data Fusion and Cloud Composer

Topics:

  • Building batch data pipelines visually with Cloud Data Fusion
  • Components
  • UI overview
  • Building a pipeline
  • Exploring data using Wrangler
  • Orchestrating work between Google Cloud services with Cloud Composer
  • Apache Airflow environment
  • DAGs and operators
  • Workflow scheduling
  • Monitoring and logging

Objectives:

  • Discuss how to manage your data pipelines with Data Fusion and Cloud Composer
  • Understand Data Fusion’s visual design capabilities
  • Learn how Cloud Composer can help to orchestrate the work across multiple Google Cloud services

Activities:

  • Lab: Building and Executing a Pipeline Graph in Data Fusion
  • Optional Lab: An introduction to Cloud Composer

Module 08 - Introduction to Processing Streaming Data

Topics: Processing Streaming Data

Objectives:

  • Explain streaming data processing
  • Describe the challenges with streaming data
  • Identify the Google Cloud products and tools that can help address streaming data challenges

Module 09 - Serverless Messaging with Pub/Sub

Topics:

  • Introduction to Pub/Sub
  • Pub/Sub push versus pull
  • Publishing with Pub/Sub code

Objectives:

  • Describe the Pub/Sub service
  • Understand how Pub/Sub works
  • Gain hands-on Pub/Sub experience with a lab that simulates real-time streaming sensor data

Activities:

  • Lab: Publish Streaming Data into Pub/Sub

Module 10 - Dataflow Streaming Features

Topics:

  • Steaming data challenges
  • Dataflow windowing

Objectives:

  • Understand the Dataflow service
  • Build a stream processing pipeline for live traffic data
  • Demonstrate how to handle late data using watermarks, triggers, and accumulation

Activities:

  • Lab: Streaming Data Pipelines

Module 11 - High-Thoughput BigQuery and Bigtable Streaming Features

Topics:

  • Streaming into BigQuery and visualizing results
  • High-throughput streaming with Cloud Bigtable
  • Optimizing Cloud Bigtable performance

Objectives:

  • Learn how to perform ad hoc analysis on streaming data using BigQuery and dashboards
  • Understand how Cloud Bigtable is a low-latency solution
  • Describe how to architect for Bigtable and how to ingest data into Bigtable
  • Highlight performance considerations for the relevant services

Activities:

  • Lab: Streaming Analytics and Dashboards
  • Lab: Streaming Data Pipelines into Bigtable

Module 12 - Advanced BigQuery Functionality and Performance

Topics:

  • Analytic window functions
  • Use With clauses
  • GIS functions
  • Performance considerations

Objectives:

  • Review some of BigQuery’s advanced analysis capabilities
  • Discuss ways to improve query performance

Activities:

  • Lab: Optimizing your BigQuery Queries for Performance
  • Optional Lab: Partitioned Tables in BigQuery

Module 13 - Introduction to Analytics and AI

Topics:

  • What is AI?
  • From ad-hoc data analysis to data-driven decisions
  • Options for ML models on Google Cloud

Objectives:

  • Understand the proposition that ML adds value to your data
  • Understand the relationship between ML, AI, and Deep Learning
  • Identify ML options on Google Cloud

Module 14 - Prebuilt ML Model APIs for Unstructured Data

Topics:

  • Unstructured data is hard
  • ML APIs for enriching data

Objectives:

  • Discuss challenges when working with unstructured data
  • Learn the applications of ready-to-use ML APIs on unstructured data

Activities:

  • Lab: Using the Natural Language API to Classify Unstructured Text

Module 15 - Big Data Analytics with Notebooks

Topics:

  • What’s a notebook?
  • BigQuery magic and ties to Pandas

Objectives:

  • Introduce Notebooks as a tool for prototyping ML solutions
  • Learn to execute BigQuery commands from Notebooks

Activities:

  • Lab: BigQuery in Jupyter Labs on AI Platform

Module 16 - Production ML Pipelines

Topics:

  • Ways to do ML on Google Cloud
  • Vertex AI Pipelines
  • AI Hub

Objectives:

  • Describe options available for building custom ML models
  • Understand the use of tools like Vertex AI Pipelines

Activities:

  • Lab: Running Pipelines on Vertex AI

Module 17 - Custom Model Building with SQL in BigQuery ML

Topics:

  • BigQuery ML for quick model building
  • Supported models

Objectives:

  • Learn how to create ML models by using SQL syntax in BigQuery
  • Demonstrate building different kinds of ML models using BigQuery ML

Activities:

  • Lab option 1: Predict Bike Trip Duration with a Regression Model in BigQuery ML
  • Lab option 2: Movie Recommendations in BigQuery ML

Module 18 - Custom Model Building with AutoML

Topics:

  • Why AutoML?
  • AutoML Vision
  • AutoML NLP
  • AutoML tables

Objectives:

  • Explore various AutoML products used in machine learning
  • Learn to use AutoML to create powerful models without coding