Practice Free Amazon DEA-C01 Exam Online Questions
A company needs to collect logs for an Amazon RDS for MySQL database and make the logs available for audits. The logs must track each user that modifies data in the database or makes changes to the database instance.
Which solution will meet these requirements?
- A . Enable Amazon CloudWatch Logs. Create metric filters to monitor database changes and instance-level changes. Configure automated notification systems to send near real-time alerts for suspicious database operations.
- B . Configure an Amazon EventBridge rule to monitor database activity. Create an AWS Lambda function to process EventBridge events and store them in Amazon OpenSearch Service.
- C . Configure AWS CloudTrail to log API calls. Use Amazon CloudWatch Logs for basic monitoring. Use IAM policies to control access to the logs. Set up scheduled reporting for log audits.
- D . Enable and configure native Amazon RDS database audit logging. Enable Amazon CloudWatch
Logs. Configure metric filters and alarms. Configure AWS CloudTrail audit logging.
D
Explanation:
The correct answer is D: Enable native Amazon RDS database audit logging with CloudWatch and CloudTrail integration.
Native RDS database audit logging captures SQL-level activities, including which user is modifying data, which is essential for meeting audit requirements.
Amazon CloudWatch Logs can be used to centralize and store logs.
CloudTrail logs API calls related to instance-level changes, ensuring all modifications are traceable.
This approach provides a comprehensive audit trail and supports compliance with security governance policies.
“Enable and configure native Amazon RDS database audit logging. Enable Amazon CloudWatch Logs. Configure metric filters and alarms. Configure AWS CloudTrail audit logging.”
Reference: DEA-C01 Official Study Guide C Chapter 7: Data Security and Governance Verified on AWS Docs:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_LogAccess.Concepts.MySQL.ht ml
A company runs an AWS Glue workflow every day to process time series data from an Amazon S3 bucket. The workflow loads the data into an Amazon Redshift Serverless table. The company observes that some of the jobs in the workflow occasionally fail.
A data engineer must receive a notification when the Redshift table does not contain the most recent data.
Which solution will meet this requirement in the MOST operationally efficient way?
- A . Configure an Amazon EventBridge Scheduler to run an Amazon Macie job to scan the Redshift table for data freshness. Configure Macie to notify an Amazon Simple Notification Service (Amazon SNS) topic when an AWS Glue job fails.
- B . Schedule an AWS Glue Data Quality job to check the freshness of the data. Create an Amazon EventBridge rule to notify an Amazon Simple Notification Service (Amazon SNS) topic when a data quality rule fails.
- C . Load AWS Glue job logs to an Amazon S3 bucket. Configure an Amazon CloudWatch alarm to send a notification when the job logs in the S3 bucket contain Job.State=FAILED.
- D . Create an Amazon CloudWatch dashboard that displays a metric named Failed AWS Glue Jobs that counts AWS Glue job failures during the previous day. Set a CloudWatch alarm to send a notification when the metric value exceeds zero.
B
Explanation:
Option B is the most operationally efficient because it checks the business requirement directly: whether the target table contains the most recent data, not merely whether a job failed. Monitoring only failures (Options C and D) can produce false positives (a job failure might not impact freshness) and false negatives (a job can succeed but still load stale or incomplete data). The study material emphasizes implementing data quality validation as part of the ETL process so data can be verified before or as it is stored, rather than relying only on pipeline execution status.
Using a data quality rule focused on freshness (for example, validating that a “max event timestamp” or “latest partition date” meets today’s expected value) lets the pipeline detect stale loads even when the workflow runs. Then, an EventBridge rule can route failures of that data quality check to SNS for immediate notification, keeping operations serverless and centralized. Macie (Option A) is designed for sensitive-data discovery/classification, not operational “freshness” checks on Redshift tables, so it adds unnecessary services and effort compared to a Glue-native data quality validation approach.
A data engineer configures a large number of AWS Glue jobs that all start up around the same time. All the jobs run for less than 1 hour in the same subnet of the same VPC. All the AWS Glue jobs run on a G.1X worker type.
Some of the jobs occasionally fail with the following error: “The specified subnet does not have enough free addresses to satisfy the request.”
What is the likely root cause of the error?
- A . There are not enough IP addresses in the subnet.
- B . The G.1X worker type cannot access the subnet.
- C . AWS Glue does not have the correct IAM permissions to add additional IP addresses to the subnet.
- D . There are not enough IP addresses in the VPC.
A
Explanation:
Option A is correct because when many AWS Glue jobs start at the same time in the same subnet, Glue needs to create network interfaces in that subnet. If the subnet does not have enough free private IP addresses available, jobs can fail with exactly this type of error. AWS service documentation for VPC-based managed data services consistently notes that if there is no available free IP address in a specified subnet, the service cannot create or add the required ENIs and the workload can fail or degrade.
This is a subnet-level exhaustion issue, not a VPC-wide issue. A VPC can still have free addresses in other subnets while the specific subnet chosen for the jobs is out of usable addresses. That makes D incorrect.
Option B is incorrect because G.1X is a valid Glue worker type and does not inherently prevent subnet access.
Option C is also incorrect because the error is about address availability, not IAM authorization. In practice, this problem is usually resolved by using a larger subnet or spreading workloads across additional subnets so enough IPs are available when multiple Glue jobs launch concurrently. This aligns with the exam’s focus on networking capacity planning for managed data services.
A company currently stores all of its data in Amazon S3 by using the S3 Standard storage class.
A data engineer examined data access patterns to identify trends. During the first 6 months, most data files are accessed several times each day. Between 6 months and 2 years, most data files are accessed once or twice each month. After 2 years, data files are accessed only once or twice each year.
The data engineer needs to use an S3 Lifecycle policy to develop new data storage rules. The new storage solution must continue to provide high availability.
Which solution will meet these requirements in the MOST cost-effective way?
- A . Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.
- B . Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Flexible Retrieval after 2 years.
- C . Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.
- D . Transition objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. Transfer objects to S3 Glacier Deep Archive after 2 years.
C
Explanation:
To achieve the most cost-effective storage solution, the data engineer needs to use an S3 Lifecycle policy that transitions objects to lower-cost storage classes based on their access patterns, and deletes them when they are no longer needed. The storage classes should also provide high availability, which means they should be resilient to the loss of data in a single Availability Zone1. Therefore, the solution must include the following steps:
Transition objects to S3 Standard-Infrequent Access (S3 Standard-IA) after 6 months. S3 Standard-IA is designed for data that is accessed less frequently, but requires rapid access when needed. It offers the same high durability, throughput, and low latency as S3 Standard, but with a lower storage cost and a retrieval fee2. Therefore, it is suitable for data files that are accessed once or twice each month. S3 Standard-IA also provides high availability, as it stores data redundantly across multiple Availability Zones1.
Transfer objects to S3 Glacier Deep Archive after 2 years. S3 Glacier Deep Archive is the lowest-cost storage class that offers secure and durable storage for data that is rarely accessed and can tolerate a
12-hour retrieval time. It is ideal for long-term archiving and digital preservation3. Therefore, it is suitable for data files that are accessed only once or twice each year. S3 Glacier Deep Archive also provides high availability, as it stores data across at least three geographically dispersed Availability Zones1.
Delete objects when they are no longer needed. The data engineer can specify an expiration action in the S3 Lifecycle policy to delete objects after a certain period of time. This will reduce the storage cost and comply with any data retention policies.
Option C is the only solution that includes all these steps. Therefore, option C is the correct answer.
Option A is incorrect because it transitions objects to S3 One Zone-Infrequent Access (S3 One Zone-IA) after 6 months. S3 One Zone-IA is similar to S3 Standard-IA, but it stores data in a single Availability Zone. This means it has a lower availability and durability than S3 Standard-IA, and it is not resilient to the loss of data in a single Availability Zone1. Therefore, it does not provide high availability as required.
Option B is incorrect because it transfers objects to S3 Glacier Flexible Retrieval after 2 years. S3 Glacier Flexible Retrieval is a storage class that offers secure and durable storage for data that is accessed infrequently and can tolerate a retrieval time of minutes to hours. It is more expensive than S3 Glacier Deep Archive, and it is not suitable for data that is accessed only once or twice each year3. Therefore, it is not the most cost-effective option.
Option D is incorrect because it combines the errors of option A and B. It transitions objects to S3 One Zone-IA after 6 months, which does not provide high availability, and it transfers objects to S3 Glacier Flexible Retrieval after 2 years, which is not the most cost-effective option.
1: Amazon S3 storage classes – Amazon Simple Storage Service
2: Amazon S3 Standard-Infrequent Access (S3 Standard-IA) – Amazon Simple Storage Service
3: Amazon S3 Glacier and S3 Glacier Deep Archive – Amazon Simple Storage Service
[4]: Expiring objects – Amazon Simple Storage Service
[5]: Managing your storage lifecycle – Amazon Simple Storage Service
[6]: Examples of S3 Lifecycle configuration – Amazon Simple Storage Service
[7]: Amazon S3 Lifecycle further optimizes storage cost savings with new features – What’s New with
AWS
A company receives test results from testing facilities that are located around the world. The company stores the test results in millions of 1 KB JSON files in an Amazon S3 bucket. A data engineer needs to process the files, convert them into Apache Parquet format, and load them into Amazon Redshift tables. The data engineer uses AWS Glue to process the files, AWS Step Functions to orchestrate the processes, and Amazon Event Bridge to schedule jobs.
The company recently added more testing facilities. The time required to process files is increasing.
The data engineer must reduce the data processing time.
Which solution will MOST reduce the data processing time?
- A . Use AWS Lambda to group the raw input files into larger files. Write the larger files back to Amazon S3. Use AWS Glue to process the files. Load the files into the Amazon Redshift tables.
- B . Use the AWS Glue dynamic frame file-grouping option to ingest the raw input files. Process the files. Load the files into the Amazon Redshift tables.
- C . Use the Amazon Redshift COPY command to move the raw input files from Amazon S3 directly into the Amazon Redshift tables. Process the files in Amazon Redshift.
- D . Use Amazon EMR instead of AWS Glue to group the raw input files. Process the files in Amazon EMR. Load the files into the Amazon Redshift tables.
B
Explanation:
Problem Analysis:
Millions of 1 KB JSON files in S3 are being processed and converted to Apache Parquet format using AWS Glue.
Processing time is increasing due to the additional testing facilities.
The goal is to reduce processing time while using the existing AWS Glue framework.
Key Considerations:
AWS Glue offers the dynamic frame file-grouping feature, which consolidates small files into larger, more efficient datasets during processing.
Grouping smaller files reduces overhead and speeds up processing.
Solution Analysis:
Option A: Lambda for File Grouping
Using Lambda to group files would add complexity and operational overhead. Glue already offers built-in grouping functionality.
Option B: AWS Glue Dynamic Frame File-Grouping
This option directly addresses the issue by grouping small files during Glue job execution.
Minimizes data processing time with no extra overhead.
Option C: Redshift COPY Command
COPY directly loads raw files but is not designed for pre-processing (conversion to Parquet).
Option D: Amazon EMR
While EMR is powerful, replacing Glue with EMR increases operational complexity.
Final Recommendation:
Use AWS Glue dynamic frame file-grouping for optimized data ingestion and processing.
AWS Glue Dynamic Frames
Optimizing Glue Performance
A company maintains multiple extract, transform, and load (ETL) workflows that ingest data from the company’s operational databases into an Amazon S3 based data lake. The ETL workflows use AWS Glue and Amazon EMR to process data.
The company wants to improve the existing architecture to provide automated orchestration and to require minimal manual effort.
Which solution will meet these requirements with the LEAST operational overhead?
- A . AWS Glue workflows
- B . AWS Step Functions tasks
- C . AWS Lambda functions
- D . Amazon Managed Workflows for Apache Airflow (Amazon MWAA) workflows
A
Explanation:
AWS Glue workflows are a feature of AWS Glue that enable you to create and visualize complex ETL pipelines using AWS Glue components, such as crawlers, jobs, triggers, and development endpoints. AWS Glue workflows provide automated orchestration and require minimal manual effort, as they handle dependency resolution, error handling, state management, and resource allocation for your ETL workflows. You can use AWS Glue workflows to ingest data from your operational databases into your Amazon S3 based data lake, and then use AWS Glue and Amazon EMR to process the data in the data lake. This solution will meet the requirements with the least operational overhead, as it leverages the serverless and fully managed nature of AWS Glue, and the scalability and flexibility of Amazon EMR12.
The other options are not optimal for the following reasons:
B. AWS Step Functions tasks. AWS Step Functions is a service that lets you coordinate multiple AWS services into serverless workflows. You can use AWS Step Functions tasks to invoke AWS Glue and Amazon EMR jobs as part of your ETL workflows, and use AWS Step Functions state machines to define the logic and flow of your workflows. However, this option would require more manual effort than AWS Glue workflows, as you would need to write JSON code to define your state machines, handle errors and retries, and monitor the execution history and status of your workflows3.
C. AWS Lambda functions. AWS Lambda is a service that lets you run code without provisioning or managing servers. You can use AWS Lambda functions to trigger AWS Glue and Amazon EMR jobs as part of your ETL workflows, and use AWS Lambda event sources and destinations to orchestrate the flow of your workflows. However, this option would also require more manual effort than AWS Glue workflows, as you would need to write code to implement your business logic, handle errors and retries, and monitor the invocation and execution of your Lambda functions. Moreover, AWS Lambda functions have limitations on the execution time, memory, and concurrency, which may affect the performance and scalability of your ETL workflows.
D. Amazon Managed Workflows for Apache Airflow (Amazon MWAA) workflows. Amazon MWAA is a managed service that makes it easy to run open source Apache Airflow on AWS. Apache Airflow is a popular tool for creating and managing complex ETL pipelines using directed acyclic graphs (DAGs). You can use Amazon MWAA workflows to orchestrate AWS Glue and Amazon EMR jobs as part of your ETL workflows, and use the Airflow web interface to visualize and monitor your workflows. However, this option would have more operational overhead than AWS Glue workflows, as you would need to set up and configure your Amazon MWAA environment, write Python code to define your DAGs, and manage the dependencies and versions of your Airflow plugins and operators.
1: AWS Glue Workflows
2: AWS Glue and Amazon EMR
3: AWS Step Functions
: AWS Lambda
: Amazon Managed Workflows for Apache Airflow
A data engineer uses Amazon Kinesis Data Streams to ingest and process records that contain user behavior data from an application every day.
The data engineer notices that the data stream is experiencing throttling because hot shards receive much more data than other shards in the data stream.
How should the data engineer resolve the throttling issue?
- A . Use a random partition key to distribute the ingested records.
- B . Increase the number of shards in the data stream. Distribute the records across the shards.
- C . Limit the number of records that are sent each second by the producer to match the capacity of the stream.
- D . Decrease the size of the records that the producer sends to match the capacity of the stream.
A company stores customer data in an Amazon S3 bucket. The company must permanently delete all customer data that is older than 7 years.
- A . Configure an S3 Lifecycle policy to permanently delete objects that are older than 7 years.
- B . Use Amazon Athena to query the S3 bucket for objects that are older than 7 years. Configure Athena to delete the results.
- C . Configure an S3 Lifecycle policy to move objects that are older than 7 years to S3 Glacier Deep Archive.
- D . Configure an S3 Lifecycle policy to enable S3 Object Lock on all objects that are older than 7 years.
A
Explanation:
S3 Lifecycle policies automate data retention and deletion. By specifying an expiration rule for 7 years, objects older than that period are permanently deleted without manual intervention.
“To automatically delete aged data, configure an S3 Lifecycle rule with an expiration policy for objects older than the retention period.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf
A data engineer needs to query data from multiple sources to generate an annual report. The analytics team uses Amazon Redshift for analysis. The data engineer needs to integrate Amazon Redshift data with 10 years of historical data from Amazon RDS for PostgreSQL and RDS for MySQL. All the databases are in the same VPC. The data engineer needs a solution that provides seamless data integration with Amazon Redshift.
Which solution will meet these requirements in the MOST cost-effective way?
- A . Use federated queries in Amazon Redshift to fetch data from RDS for PostgreSQL and RDS for MySQL. Apply the necessary transformations within Amazon Redshift.
- B . Use the SELECT INTO OUTFILE S3 statement to export data from Amazon RDS to Amazon S3. Use the COPY command to load the data into Amazon Redshift.
- C . Create a visual extract, transform, and load (ETL) job in AWS Glue to extract the required data and load it to Amazon Redshift.
- D . Use AWS Database Migration Service (AWS DMS) to ingest data from RDS for PostgreSQL and RDS for MySQL. Implement the necessary transformations within Amazon Redshift.
A
Explanation:
Option A is the most cost-effective because it enables seamless integration by allowing analysts to access and join external relational data directly from within Amazon Redshift for reporting, without building and operating a separate ingestion pipeline for a one-time (annual) workload. The study material frames Amazon Redshift as the centralized analytics store for complex queries and reporting, making it the natural place to perform the final transformations and analysis.
The alternatives introduce extra operational steps and recurring costs.
Option B requires exporting large historical datasets to S3 and then loading them into Redshift, which increases data movement and operational complexity.
Option C adds ETL job development, scheduling, retries, and monitoring overhead that is unnecessary if the primary goal is integrated querying for an annual report.
Option D (DMS) is best when you need ongoing migration or continuous replication; it is typically more operationally heavy than needed for “query across sources” use cases and still requires additional setup and maintenance.
Because all databases are already in the same VPC and the analytics platform is Redshift, federated querying provides the most direct, lowest-operations path to integrate and analyze the data where it already needs to be consumed.
A company stores data from an application in an Amazon DynamoDB table that operates in provisioned capacity mode. The workloads of the application have predictable throughput load on a regular schedule. Every Monday, there is an immediate increase in activity early in the morning. The application has very low usage during weekends.
The company must ensure that the application performs consistently during peak usage times.
Which solution will meet these requirements in the MOST cost-effective way?
- A . Increase the provisioned capacity to the maximum capacity that is currently present during peak load times.
- B . Divide the table into two tables. Provision each table with half of the provisioned capacity of the original table. Spread queries evenly across both tables.
- C . Use AWS Application Auto Scaling to schedule higher provisioned capacity for peak usage times.
Schedule lower capacity during off-peak times. - D . Change the capacity mode from provisioned to on-demand. Configure the table to scale up and scale down based on the load on the table.
C
Explanation:
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. DynamoDB offers two capacity modes for throughput capacity: provisioned and on-demand. In provisioned capacity mode, you specify the number of read and write capacity units per second that you expect your application to require. DynamoDB reserves the resources to meet your throughput needs with consistent performance. In on-demand capacity mode, you pay per request and DynamoDB scales the resources up and down automatically based on the actual workload. On-demand capacity mode is suitable for unpredictable workloads that can vary significantly over time1.
The solution that meets the requirements in the most cost-effective way is to use AWS Application Auto Scaling to schedule higher provisioned capacity for peak usage times and lower capacity during off-peak times.
This solution has the following advantages:
It allows you to optimize the cost and performance of your DynamoDB table by adjusting the provisioned capacity according to your predictable workload patterns. You can use scheduled scaling to specify the date and time for the scaling actions, and the new minimum and maximum capacity limits. For example, you can schedule higher capacity for every Monday morning and lower capacity for weekends2.
It enables you to take advantage of the lower cost per unit of provisioned capacity mode compared to on-demand capacity mode. Provisioned capacity mode charges a flat hourly rate for the capacity you reserve, regardless of how much you use. On-demand capacity mode charges for each read and write request you consume, with no minimum capacity required. For predictable workloads, provisioned capacity mode can be more cost-effective than on-demand capacity mode1.
It ensures that your application performs consistently during peak usage times by having enough capacity to handle the increased load. You can also use auto scaling to automatically adjust the provisioned capacity based on the actual utilization of your table, and set a target utilization percentage for your table or global secondary index. This way, you can avoid under-provisioning or over-provisioning your table2.
Option A is incorrect because it suggests increasing the provisioned capacity to the maximum capacity that is currently present during peak load times. This solution has the following disadvantages:
It wastes money by paying for unused capacity during off-peak times. If you provision the same high capacity for all times, regardless of the actual workload, you are over-provisioning your table and paying for resources that you don’t need1.
It does not account for possible changes in the workload patterns over time. If your peak load times increase or decrease in the future, you may need to manually adjust the provisioned capacity to match the new demand. This adds operational overhead and complexity to your application2.
Option B is incorrect because it suggests dividing the table into two tables and provisioning each table with half of the provisioned capacity of the original table.
This solution has the following disadvantages:
It complicates the data model and the application logic by splitting the data into two separate tables. You need to ensure that the queries are evenly distributed across both tables, and that the data is consistent and synchronized between them. This adds extra development and maintenance effort to your application3.
It does not solve the problem of adjusting the provisioned capacity according to the workload patterns. You still need to manually or automatically scale the capacity of each table based on the actual utilization and demand. This may result in under-provisioning or over-provisioning your tables2.
Option D is incorrect because it suggests changing the capacity mode from provisioned to on-demand. This solution has the following disadvantages:
It may incur higher costs than provisioned capacity mode for predictable workloads. On-demand capacity mode charges for each read and write request you consume, with no minimum capacity required. For predictable workloads, provisioned capacity mode can be more cost-effective than on-demand capacity mode, as you can reserve the capacity you need at a lower rate1.
It may not provide consistent performance during peak usage times, as on-demand capacity mode may take some time to scale up the resources to meet the sudden increase in demand. On-demand capacity mode uses adaptive capacity to handle bursts of traffic, but it may not be able to handle very large spikes or sustained high throughput. In such cases, you may experience throttling or increased latency.
1: Choosing the right DynamoDB capacity mode – Amazon DynamoDB
2: Managing throughput capacity automatically with DynamoDB auto scaling – Amazon DynamoDB
3: Best practices for designing and using partition keys effectively – Amazon DynamoDB
[4]: On-demand mode guidelines – Amazon DynamoDB
[5]: How to optimize Amazon DynamoDB costs – AWS Database Blog
[6]: DynamoDB adaptive capacity: How it works and how it helps – AWS Database Blog
[7]: Amazon DynamoDB pricing – Amazon Web Services (AWS)
