Practice Free Amazon DEA-C01 Exam Online Questions
A retail company stores data from a product lifecycle management (PLM) application in an on-premises MySQL database. The PLM application frequently updates the database when transactions occur.
The company wants to gather insights from the PLM application in near real time. The company wants to integrate the insights with other business datasets and to analyze the combined dataset by using an Amazon Redshift data warehouse.
The company has already established an AWS Direct Connect connection between the on-premises infrastructure and AWS.
Which solution will meet these requirements with the LEAST development effort?
- A . Run a scheduled AWS Glue extract, transform, and load (ETL) job to get the MySQL database updates by using a Java Database Connectivity (JDBC) connection. Set Amazon Redshift as the destination for the ETL job.
- B . Run a full load plus CDC task in AWS Database Migration Service (AWS DMS) to continuously replicate the MySQL database changes. Set Amazon Redshift as the destination for the task.
- C . Use the Amazon AppFlow SDK to build a custom connector for the MySQL database to continuously replicate the database changes. Set Amazon Redshift as the destination for the connector.
- D . Run scheduled AWS DataSync tasks to synchronize data from the MySQL database. Set Amazon Redshift as the destination for the tasks.
B
Explanation:
Problem Analysis:
The company needs near real-time replication of MySQL updates to Amazon Redshift.
Minimal development effort is required for this solution.
Key Considerations:
AWS DMS provides a full load + CDC (Change Data Capture) mode for continuous replication of database changes.
DMS integrates natively with both MySQL and Redshift, simplifying setup.
Solution Analysis:
Option A: AWS Glue Job
Glue is batch-oriented and does not support near real-time replication.
Option B: DMS with Full Load + CDC
Efficiently handles initial database load and continuous updates.
Requires minimal setup and operational overhead.
Option C: AppFlow SDK
AppFlow is not designed for database replication. Custom connectors increase development effort.
Option D: DataSync
DataSync is for file synchronization and not suitable for database updates.
Final Recommendation:
Use AWS DMS in full load + CDC mode for continuous replication.
AWS Database Migration Service Documentation
Setting Up DMS with Redshift
A company uses AWS Glue jobs to implement several data pipelines. The pipelines are critical to the company.
The company needs to implement a monitoring mechanism that will alert stakeholders if the pipelines fail.
Which solution will meet these requirements with the LEAST operational overhead?
- A . Create an Amazon EventBridge rule to match AWS Glue job failure events. Configure the rule to target an AWS Lambda function to process events. Configure the function to send notifications to an Amazon Simple Notification Service (Amazon SNS) topic.
- B . Configure an Amazon CloudWatch Logs log group for the AWS Glue jobs. Create an Amazon EventBridge rule to match new log creation events in the log group. Configure the rule to target an AWS Lambda function that reads the logs and sends notifications to an Amazon Simple Notification Service (Amazon SNS) topic if AWS Glue job failure logs are present.
- C . Create an Amazon EventBridge rule to match AWS Glue job failure events. Define an Amazon
CloudWatch metric based on the EventBridge rule. Set up a CloudWatch alarm based on the metric to send notifications to an Amazon Simple Notification Service (Amazon SNS) topic. - D . Configure an Amazon CloudWatch Logs log group for the AWS Glue jobs. Create an Amazon EventBridge rule to match new log creation events in the log group. Configure the rule to send notifications to an Amazon Simple Notification Service (Amazon SNS) topic.
A
Explanation:
Creating an EventBridge rule that triggers a Lambda function on AWS Glue job failure events and then sends notifications via Amazon SNS is the most direct and operationally efficient method:
“Practice Quiz 10: A data engineer must monitor the data pipeline…
Which solution will meet these requirements?
A company wants to migrate an application and an on-premises Apache Kafka server to AWS. The application processes incremental updates that an on-premises Oracle database sends to the Kafka server. The company wants to use the replatform migration strategy instead of the refactor strategy.
Which solution will meet these requirements with the LEAST management overhead?
- A . Amazon Kinesis Data Streams
- B . Amazon Managed Streaming for Apache Kafka (Amazon MSK) provisioned cluster
- C . Amazon Data Firehose
- D . Amazon Managed Streaming for Apache Kafka (Amazon MSK) Serverless
D
Explanation:
Problem Analysis:
The company needs to migrate both an application and an on-premises Apache Kafka server to AWS.
Incremental updates from an on-premises Oracle database are processed by Kafka.
The solution must follow a replatform migration strategy, prioritizing minimal changes and low management overhead.
Key Considerations:
Replatform Strategy: This approach keeps the application and architecture as close to the original as possible, reducing the need for refactoring.
The solution must provide a managed Kafka service to minimize operational burden. Low overhead solutions like serverless services are preferred. Solution Analysis:
Option A: Kinesis Data Streams
Kinesis Data Streams is an AWS-native streaming service but is not a direct substitute for Kafka.
This option would require significant application refactoring, which does not align with the replatform strategy.
Option B: MSK Provisioned Cluster
Managed Kafka service with fully configurable clusters.
Provides the same Kafka APIs but requires cluster management (e.g., scaling, patching), increasing management overhead.
Option C: Amazon Kinesis Data Firehose
Kinesis Data Firehose is designed for data delivery rather than real-time streaming and processing.
Not suitable for Kafka-based applications.
Option D: MSK Serverless
MSK Serverless eliminates the need for cluster management while maintaining compatibility with Kafka APIs.
Automatically scales based on workload, reducing operational overhead.
Ideal for replatform migrations, as it requires minimal changes to the application.
Final Recommendation:
Amazon MSK Serverless is the best solution for migrating the Kafka server and application with minimal changes and the least management overhead.
Amazon MSK Serverless Overview
Comparison of Amazon MSK and Kinesis
An ecommerce company wants to use AWS to migrate data pipelines from an on-premises environment into the AWS Cloud. The company currently uses a third-party too in the on-premises environment to orchestrate data ingestion processes.
The company wants a migration solution that does not require the company to manage servers. The solution must be able to orchestrate Python and Bash scripts. The solution must not require the company to refactor any code.
Which solution will meet these requirements with the LEAST operational overhead?
- A . AWS Lambda
- B . Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
- C . AWS Step Functions
- D . AWS Glue
B
Explanation:
The ecommerce company wants to migrate its data pipelines into the AWS Cloud without managing servers, and the solution must orchestrate Python and Bash scripts without refactoring code. Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is the most suitable solution for this scenario.
Option B: Amazon Managed Workflows for Apache Airflow (Amazon MWAA) MWAA is a managed orchestration service that supports Python and Bash scripts via Directed Acyclic Graphs (DAGs) for workflows. It is a serverless, managed version of Apache Airflow, which is commonly used for orchestrating complex data workflows, making it an ideal choice for migrating existing pipelines without refactoring. It supports Python, Bash, and other scripting languages, and the company would not need to manage the underlying infrastructure.
Other options:
AWS Lambda (Option A) is more suited for event-driven workflows but would require breaking down the pipeline into individual Lambda functions, which may require refactoring.
AWS Step Functions (Option C) is good for orchestration but lacks native support for Python and Bash without using Lambda functions, and it may require code changes.
AWS Glue (Option D) is an ETL service primarily for data transformation and not suitable for
orchestrating general scripts without modification.
Reference: Amazon Managed Workflows for Apache Airflow (MWAA) Documentation
A manufacturing company wants to collect data from sensors. A data engineer needs to implement a solution that ingests sensor data in near real time.
The solution must store the data to a persistent data store. The solution must store the data in nested JSON format. The company must have the ability to query from the data store with a latency of less than 10 milliseconds.
Which solution will meet these requirements with the LEAST operational overhead?
- A . Use a self-hosted Apache Kafka cluster to capture the sensor data. Store the data in Amazon S3 for querying.
- B . Use AWS Lambda to process the sensor data. Store the data in Amazon S3 for querying.
- C . Use Amazon Kinesis Data Streams to capture the sensor data. Store the data in Amazon DynamoDB for querying.
- D . Use Amazon Simple Queue Service (Amazon SQS) to buffer incoming sensor data. Use AWS Glue to store the data in Amazon RDS for querying.
C
Explanation:
Amazon Kinesis Data Streams is a service that enables you to collect, process, and analyze streaming data in real time. You can use Kinesis Data Streams to capture sensor data from various sources, such as IoT devices, web applications, or mobile apps. You can create data streams that can scale up to handle any amount of data from thousands of producers. You can also use the Kinesis Client Library (KCL) or the Kinesis Data Streams API to write applications that process and analyze the data in the streams1.
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. You can use DynamoDB to store the sensor data in nested JSON format, as DynamoDB supports document data types, such as lists and maps. You can also use DynamoDB to query the data with a latency of less than 10 milliseconds, as DynamoDB offers single-digit millisecond performance for any scale of data. You can use the DynamoDB API or the AWS SDKs to perform queries on the data, such as using key-value lookups, scans, or queries2.
The solution that meets the requirements with the least operational overhead is to use Amazon Kinesis Data Streams to capture the sensor data and store the data in Amazon DynamoDB for querying.
This solution has the following advantages:
It does not require you to provision, manage, or scale any servers, clusters, or queues, as Kinesis Data Streams and DynamoDB are fully managed services that handle all the infrastructure for you. This reduces the operational complexity and cost of running your solution.
It allows you to ingest sensor data in near real time, as Kinesis Data Streams can capture data records as they are produced and deliver them to your applications within seconds. You can also use Kinesis Data Firehose to load the data from the streams to DynamoDB automatically and continuously3.
It allows you to store the data in nested JSON format, as DynamoDB supports document data types, such as lists and maps. You can also use DynamoDB Streams to capture changes in the data and trigger actions, such as sending notifications or updating other databases.
It allows you to query the data with a latency of less than 10 milliseconds, as DynamoDB offers single-digit millisecond performance for any scale of data. You can also use DynamoDB Accelerator (DAX) to improve the read performance by caching frequently accessed data.
Option A is incorrect because it suggests using a self-hosted Apache Kafka cluster to capture the sensor data and store the data in Amazon S3 for querying. This solution has the following disadvantages:
It requires you to provision, manage, and scale your own Kafka cluster, either on EC2 instances or on-premises servers. This increases the operational complexity and cost of running your solution.
It does not allow you to query the data with a latency of less than 10 milliseconds, as Amazon S3 is an object storage service that is not optimized for low-latency queries. You need to use another service, such as Amazon Athena or Amazon Redshift Spectrum, to query the data in S3, which may incur additional costs and latency.
Option B is incorrect because it suggests using AWS Lambda to process the sensor data and store the data in Amazon S3 for querying. This solution has the following disadvantages:
It does not allow you to ingest sensor data in near real time, as Lambda is a serverless compute service that runs code in response to events. You need to use another service, such as API Gateway or Kinesis Data Streams, to trigger Lambda functions with sensor data, which may add extra latency and complexity to your solution.
It does not allow you to query the data with a latency of less than 10 milliseconds, as Amazon S3 is an object storage service that is not optimized for low-latency queries. You need to use another service, such as Amazon Athena or Amazon Redshift Spectrum, to query the data in S3, which may incur additional costs and latency.
Option D is incorrect because it suggests using Amazon Simple Queue Service (Amazon SQS) to buffer incoming sensor data and use AWS Glue to store the data in Amazon RDS for querying. This solution has the following disadvantages:
It does not allow you to ingest sensor data in near real time, as Amazon SQS is a message queue service that delivers messages in a best-effort manner. You need to use another service, such as Lambda or EC2, to poll the messages from the queue and process them, which may add extra latency and complexity to your solution.
It does not allow you to store the data in nested JSON format, as Amazon RDS is a relational database service that supports structured data types, such as tables and columns. You need to use another service, such as AWS Glue, to transform the data from JSON to relational format, which may add extra cost and overhead to your solution.
1: Amazon Kinesis Data Streams – Features
2: Amazon DynamoDB – Features
3: Loading Streaming Data into Amazon DynamoDB – Amazon Kinesis Data Firehose [4]: Capturing Table Activity with DynamoDB Streams – Amazon DynamoDB
[5]: Amazon DynamoDB Accelerator (DAX) – Features
[6]: Amazon S3 – Features
[7]: AWS Lambda – Features
[8]: Amazon Simple Queue Service – Features
[9]: Amazon Relational Database Service – Features
[10]: Working with JSON in Amazon RDS – Amazon Relational Database Service
[11]: AWS Glue – Features
A company uses Amazon S3 to store data and Amazon QuickSight to create visualizations.
The company has an S3 bucket in an AWS account named Hub-Account. The S3 bucket is encrypted by an AWS Key Management Service (AWS KMS) key. The company’s QuickSight instance is in a separate account named BI-Account
The company updates the S3 bucket policy to grant access to the QuickSight service role. The company wants to enable cross-account access to allow QuickSight to interact with the S3 bucket.
Which combination of steps will meet this requirement? (Select TWO.)
- A . Use the existing AWS KMS key to encrypt connections from QuickSight to the S3 bucket.
- B . Add the 53 bucket as a resource that the QuickSight service role can access.
- C . Use AWS Resource Access Manager (AWS RAM) to share the S3 bucket with the Bl-Account account.
- D . Add an IAM policy to the QuickSight service role to give QuickSight access to the KMS key that encrypts the S3 bucket.
- E . Add the KMS key as a resource that the QuickSight service role can access.
D,E
Explanation:
Problem Analysis:
The company needs cross-account access to allow QuickSight in BI-Account to interact with an S3 bucket in Hub-Account.
The bucket is encrypted with an AWS KMS key.
Appropriate permissions must be set for both S3 access and KMS decryption.
Key Considerations:
QuickSight requires IAM permissions to access S3 data and decrypt files using the KMS key.
Both S3 and KMS permissions need to be properly configured across accounts.
Solution Analysis:
Option A: Use Existing KMS Key for Encryption
While the existing KMS key is used for encryption, it must also grant decryption permissions to QuickSight.
Option B: Add S3 Bucket to QuickSight Role
Granting S3 bucket access to the QuickSight service role is necessary for cross-account access.
Option C: AWS RAM for Bucket Sharing
AWS RAM is not required; bucket policies and IAM roles suffice for granting cross-account access.
Option D: IAM Policy for KMS Access
QuickSight’s service role in BI-Account needs explicit permissions to use the KMS key for decryption.
Option E: Add KMS Key as Resource for Role
The KMS key must explicitly list the QuickSight role as an entity that can access it.
Implementation Steps:
S3 Bucket Policy in Hub-Account:Add a policy to the S3 bucket granting the QuickSight service role access:
json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::<BI-Account-ID>:role/service-role/QuickSightRole" },
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::<Bucket-Name>/*"
}
]
}
KMS Key Policy in Hub-Account:Add permissions for the QuickSight role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "AWS": "arn:aws:iam::<BI-Account-ID>:role/service-role/QuickSightRole" },
"Action": [
"kms:Decrypt",
"kms:DescribeKey"
],
"Resource": "*"
}
]
}
IAM Policy for QuickSight Role in BI-Account:Attach the following policy to the QuickSight service role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"kms:Decrypt"
],
"Resource": [
"arn:aws:s3:::<Bucket-Name>/*",
"arn:aws:kms:<region>:<Hub-Account-ID>:key/<KMS-Key-ID>"
]
}
]
}
Setting Up Cross-Account S3 Access
AWS KMS Key Policy Examples
Amazon QuickSight Cross-Account Access
A data engineer maintains custom Python scripts that perform a data formatting process that many AWS Lambda functions use. When the data engineer needs to modify the Python scripts, the data engineer must manually update all the Lambda functions.
The data engineer requires a less manual way to update the Lambda functions.
Which solution will meet this requirement?
- A . Store a pointer to the custom Python scripts in the execution context object in a shared Amazon S3 bucket.
- B . Package the custom Python scripts into Lambda layers. Apply the Lambda layers to the Lambda functions.
- C . Store a pointer to the custom Python scripts in environment variables in a shared Amazon S3 bucket.
- D . Assign the same alias to each Lambda function. Call reach Lambda function by specifying the function’s alias.
B
Explanation:
Lambda layers are a way to share code and dependencies across multiple Lambda functions. By packaging the custom Python scripts into Lambda layers, the data engineer can update the scripts in one place and have them automatically applied to all the Lambda functions that use the layer. This reduces the manual effort and ensures consistency across the Lambda functions. The other options are either not feasible or not efficient. Storing a pointer to the custom Python scripts in the execution context object or in environment variables would require the Lambda functions to download the scripts from Amazon S3 every time they are invoked, which would increase latency and cost. Assigning the same alias to each Lambda function would not help with updating the Python scripts,
as the alias only points to a specific version of the Lambda function code.
Reference: AWS Lambda layers
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide, Chapter 3: Data Ingestion
and Transformation, Section 3.4: AWS Lambda
A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the legacy application. The data engineer found that the legacy data contained some duplicate information.
The data engineer must identify and remove duplicate information from the legacy application data.
Which solution will meet these requirements with the LEAST operational overhead?
- A . Write a custom extract, transform, and load (ETL) job in Python. Use the Data Frame drop duplicatesf) function by importing the Pandas library to perform data deduplication.
- B . Write an AWS Glue extract, transform, and load (ETL) job. Use the FindMatches machine learning (ML) transform to transform the data to perform data deduplication.
- C . Write a custom extract, transform, and load (ETL) job in Python. Import the Python dedupe library.
Use the dedupe library to perform data deduplication. - D . Write an AWS Glue extract, transform, and load (ETL) job. Import the Python dedupe library. Use the dedupe library to perform data deduplication.
B
Explanation:
AWS Glue is a fully managed serverless ETL service that can handle data deduplication with minimal operational overhead. AWS Glue provides a built-in ML transform called FindMatches, which can automatically identify and group similar records in a dataset. FindMatches can also generate a primary key for each group of records and remove duplicates. FindMatches does not require any coding or prior ML experience, as it can learn from a sample of labeled data provided by the user. FindMatches can also scale to handle large datasets and optimize the cost and performance of the ETL job.
Reference: AWS Glue
FindMatches ML Transform
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide
A company generates reports from 30 tables in an Amazon Redshift data warehouse. The data source is an operational Amazon Aurora MySQL database that contains 100 tables. Currently, the company refreshes all data from Aurora to Redshift every hour, which causes delays in report generation.
Which combination of steps will meet these requirements with the LEAST operational overhead? (Select TWO.)
- A . Use AWS Database Migration Service (AWS DMS) to create a replication task. Select only the required tables.
- B . Create a database in Amazon Redshift that uses the integration.
- C . Create a zero-ETL integration in Amazon Aurora. Select only the required tables.
- D . Use query editor v2 in Amazon Redshift to access the data in Aurora.
- E . Create an AWS Glue job to transfer each required table. Run an AWS Glue workflow to initiate the jobs every 5 minutes.
AC
Explanation:
Option A (AWS DMS): Lets you replicate only selected tables from Aurora to Redshift. It supports ongoing replication (CDC) and reduces unnecessary data transfers.
Option C (Zero-ETL with Aurora to Redshift): This new integration allows real-time, serverless, and low-maintenance data replication. It’s designed to reduce operational overhead drastically.
Other options introduce manual processing or inefficient query access (Option D). Glue workflows (Option E) add unnecessary complexity and lag for near-real-time needs.
“Zero-ETL integration between Amazon Aurora and Redshift enables real-time analytics without building complex pipelines.”
Reference: AWS Blog C Introducing Amazon Aurora zero-ETL integration with Redshift
“With AWS DMS, you can replicate only selected tables from your database and use ongoing replication to reduce refresh latency.”
Reference: AWS DMS Documentation C AWS DMS Use Cases
A transportation company wants to track vehicle movements by capturing geolocation records. The records are 10 bytes in size. The company receives up to 10,000 records every second. Data transmission delays of a few minutes are acceptable because of unreliable network conditions.
The transportation company wants to use Amazon Kinesis Data Streams to ingest the geolocation data. The company needs a reliable mechanism to send data to Kinesis Data Streams. The company needs to maximize the throughput efficiency of the Kinesis shards.
Which solution will meet these requirements in the MOST operationally efficient way?
- A . Kinesis Agent
- B . Kinesis Producer Library (KPL)
- C . Amazon Data Firehose
- D . Kinesis SDK
B
Explanation:
Problem Analysis:
The company ingests geolocation records (10 bytes each) at 10,000 records per second into Kinesis Data Streams.
Data transmission delays are acceptable, but the solution must maximize throughput efficiency.
Key Considerations:
The Kinesis Producer Library (KPL) batches records and uses aggregation to optimize shard throughput.
Efficiently handles high-throughput scenarios with minimal operational overhead.
Solution Analysis:
Option A: Kinesis Agent
Designed for file-based ingestion; not optimized for geolocation records.
Option B: KPL
Aggregates records into larger payloads, significantly improving shard throughput.
Suitable for applications generating small, high-frequency records.
Option C: Kinesis Firehose
Firehose is for delivery to destinations like S3 or Redshift and is not optimized for direct ingestion to Kinesis Data Streams.
Option D: Kinesis SDK
The SDK lacks advanced features like aggregation, resulting in lower throughput efficiency.
Final Recommendation:
Use Kinesis Producer Library (KPL) for its built-in aggregation and batching capabilities.
Kinesis Producer Library (KPL) Overview
Best Practices for Amazon Kinesis
