Only VEEAM Authorized Training Provider in Malaysia

Data Engineering on AWS

Category:

Last Update:

October 7, 2025

Accredited by:

Course Overview:

To earn the AWS Certified Data Engineer – Associate certification, participants are required to complete the Data Engineering on AWS course. This comprehensive learning course is designed to provide a solid foundation in data engineering principles, tools, and best practices to prepare you to perform the data engineering role in the AWS Cloud. Through this course, you will progress from an introduction to the role itself, to diving deeper into building solutions using AWS services.

Course Objectives:

By completing this learning course, you will gain knowledge and skills in core data-related AWS services, in the ability to implement data pipelines, to monitor and troubleshoot issues, and to optimize cost and performance in accordance with best practices.

Expected Outcomes:
Upon successful completion of this training, participants will be able to:

Understand the Data Engineering Role & Ecosystem:
- Define the responsibilities of a data engineer.
- Identify key personas and collaboration requirements
Design and Build Scalable Data Architectures:
- Evaluate and select AWS services for data lakes, data warehouses, and streaming architectures.
- Apply best practices for designing secure, cost-efficient, and scalable solutions.
Develop and Automate Data Pipelines:
- Orchestrate and automate batch and streaming data pipelines using services like AWS Glue, EMR, Step Functions, and Kinesis.
- Implement CI/CD and IaC tools (e.g., AWS SAM, CloudFormation) for pipeline automation.
Secure, Monitor, and Troubleshoot Pipelines:
- Apply AWS security tools and practices to secure data solutions.
- Use AWS monitoring and alerting services to track performance and troubleshoot issues
Optimize Performance and Costs:
- Leverage tools for performance tuning, cost analysis, and optimization (e.g., Redshift tuning, AWS Cost Explorer).
- Automate scaling, fault-tolerance, and pipeline improvements.
Hands-on Labs for Real-World Implementation:
- Practice building batch and streaming pipelines using Amazon S3, Athena, Redshift, EMR, Glue, Kinesis, and MSK.
- Gain experience solving real analytics challenges through guided labs.

Target Audience

The target audience for a Data Engineer on AWS includes professionals who build and manage data pipelines in the cloud, such as data engineers, cloud engineers, ETL developers, and tech leads.

Prerequisite:

We recommend that attendees of this course have fundamental knowledge of:

Completed AWS Cloud Practitioner Essentials or equivalent.
Prior experience with AWS core services.
Programming experience in any one of the following languages: Python, .NET, Java.

Day 1

Module 1: Foundations – Roles and Concepts

Introduction
Data Discovery
AWS Data Services and Modern Data Architecture
Orchestration and Automation Options

Module 2: Foundations – Tools and Considerations

Continuous Integration and Continuous Delivery Tools
Infrastructure as Code Tools
AWS Serverless Application Model
Networking Considerations
Cost Optimization Tools

Module 3: A Data Lake Solution – Building a Data Lake Solution

Set Up Storage
Ingest Data
Build Data Catalog
Transform Data
Serve Data for Consumption

Lab 1: Setting up a Data Lake on AWS

Use Amazon S3 as the storage layer of a data lake.
Organize data into layers (or zones) in Amazon S3.
Configure an S3 event notification to invoke an AWS Lambda function.
Create an Amazon EventBridge rule to invoke the Lambda function.

Day 2

Module 4: A Data Lake Solution – Optimizing and Securing a Data Lake Solution

Open Table Formats
Security Using AWS Lake Formation
Troubleshooting

Lab 2: Automate Data Lake Creation using AWS Lake Formation Blueprints

Create an AWS Glue workflow using a Lake Formation blueprints
Automate the Lake Formation data lake setup process with an AWS Glue workflow
Create a custom AWS Glue workflow.

Module 5: A Data Warehouse Solution – Building a Data Warehouse Solution

Designing the Data Warehouse Solution
Ingesting Data
Processing Data
Serving Data for Consumption

Lab 3: Setting up a Data Warehouse using Amazon Redshift Serverless

Create a data warehouse with Amazon Redshift
Create a schema.
Create a table.
Load the table with sample data.

Module 6: A Data Warehouse Solution – Optimizing and Securing a Data Warehouse Solution

Monitoring and Optimizing Options
Orchestration Options
Security and Governance Options

Lab 4: Managing Access Control in Redshifts

Create and manage users and roles
Apply and manage column-level security
Apply and manage row-level security
Configure dynamic data masknig.
Review audit logs.

Day 3

Module 7: A Batch Data Pipeline Solution – Building a Batch Data Pipeline

Designing the Batch Data Pipeline
Ingesting Data

Module 8: A Batch Data Pipeline Solution – Implementing the Batch Data Pipeline

Processing and Transforming Data
Cataloging Data
Serving Data for Consumption

Lab 5: A Day in the life of a Data Engineer

Create an AWS Glue crawler.
Create and run a job in AWS Glue Studio
Explore permissions required to run AWS Glue crawlers and AWS Glue Studio jobs.
Query the AWS Glue Data Catalog using Amazon Athena.

Module 9: A Batch Data Pipeline Solution – Optimizing, Orchestrating, and Securing Batch Data Pipelines

Optimizing the Batch Data Pipeline
Orchestrating the Batch Data Pipeline
Securing and Governance of the Batch Data Pipeline

Lab 6: Orchestrate data processing in Spark using AWS Step Functions

Use Amazon Simple Storage Service (Amazon S3) Event Notifications and AWS Lambda to automate the batch processing of data.
Use the Step Functions state machine language to:
- Create an on-demand Amazon EMR cluseter.
- Add an Apache Spark step job in Amazon EMR and create an Amazon Athena table to query the processed job.
- Add an Amazon SNS topic to send a notification.
Validate a Step Functions state machine run.
Review an AWS Glue table and validate the processed data using Athena.

Day 4

Module 10: A Streaming Data Pipeline Solution – Building a Streaming Data Pipeline Solution

Ingesting Data from Stream Sources
Storing Streaming Data
Processing Data
Analyzing Data

Lab 7: Streaming Analytics with Amazon Managed Service for Apache Flink

Build a real-time streaming analytics pipeline in Managed Apache Flink Studio using Apache Flink and Apache Zeppelin to ingest, enrich, and analyze the clickstream data with catalog data stored in Amazon S3.
Perform interactive data analytics and visualize using Apache Zeppelin notebooks with Managed Apache Flink Studio.
Output the data to a Kinesis data stream for further downstream processing depending on operational needs.

Module 11: A Streaming Data Pipeline Solution – Optimizing and Securing a Streaming Data Pipeline Solution

Optimization
Security and Governance

Lab 8: Introduction to Access Control with Amazon Managed Streaming for Apache Kafka

Publish to and consume from an MSK cluster using IAM authenticated broker Uniform Resource Locators (URLs) with a Java demo producer and Java demo consumer.
Learn about the IAM method to authenticate and authorize users of an MSK cluster.