Practice Free Amazon DEA-C01 Exam Online Questions
A company’s data processing pipeline uses AWS Glue jobs and AWS Glue Data Catalog. All AWS Glue jobs must run in a custom VPC inside a private subnet. The company uses a NAT gateway to support outbound connections.
A data engineer needs to use AWS Glue to migrate data from an on-premises PostgreSQL database to Amazon S3. There is no current network connection between AWS and the on-premises environment. However, the data engineer has updated the on-premises database to allow traffic from the custom VPC.
Which solution will meet these requirements?
- A . Create a JDBC connection in AWS Glue with the database JDBC URL, username, and password.
- B . Create a Simple Authentication and Security Layer (SASL) connection in AWS Glue to the on-premises database.
- C . Create a JDBC connection in AWS Glue with a security group that allows TCP traffic to and from itself.
- D . Create a JDBC connection in AWS Glue that uses a JDBC driver stored in Amazon S3. Retrieve the database URL, username, and password from AWS Secrets Manager.
D
Explanation:
When AWS Glue jobs run inside a private subnet, they must use secure and supported methods to access external databases. AWS Glue supports JDBC connections to on-premises databases, but best practices require secure credential management and explicit driver configuration.
Using a JDBC driver stored in Amazon S3 allows Glue to connect to PostgreSQL without relying on default drivers. Storing credentials in AWS Secrets Manager eliminates hard-coded credentials and
enables secure rotation, aligning with AWS security best practices.
Simply specifying credentials inline is less secure and not recommended. SASL connections are not supported for PostgreSQL JDBC connections. Security groups alone do not establish connectivity or authentication.
Therefore, Option D is the correct and production-grade solution.
A company wants to migrate a data warehouse from Teradata to Amazon Redshift.
Which solution will meet this requirement with the LEAST operational effort?
- A . Use AWS Database Migration Service (AWS DMS) Schema Conversion to migrate the schema. Use AWS DMS to migrate the data.
- B . Use the AWS Schema Conversion Tool (AWS SCT) to migrate the schema. Use AWS Database Migration Service (AWS DMS) to migrate the data.
- C . Use AWS Database Migration Service (AWS DMS) to migrate the data. Use automatic schema conversion.
- D . Manually export the schema definition from Teradata. Apply the schema to the Amazon Redshift database. Use AWS Database Migration Service (AWS DMS) to migrate the data.
A company runs multiple applications on AWS. The company configured each application to output logs. The company wants to query and visualize the application logs in near real time.
Which solution will meet these requirements?
- A . Configure the applications to output logs to Amazon CloudWatch Logs log groups. Create an Amazon S3 bucket. Create an AWS Lambda function that runs on a schedule to export the required log groups to the S3 bucket. Use Amazon Athena to query the log data in the S3 bucket.
- B . Create an Amazon OpenSearch Service domain. Configure the applications to output logs to Amazon CloudWatch Logs log groups. Create an OpenSearch Service subscription filter for each log group to stream the data to OpenSearch. Create the required queries and dashboards in OpenSearch Service to analyze and visualize the data.
- C . Configure the applications to output logs to Amazon CloudWatch Logs log groups. Use CloudWatch log anomaly detection to query and visualize the log data.
- D . Update the application code to send the log data to Amazon QuickSight by using Super-fast, Parallel, In-memory Calculation Engine (SPICE). Create the required analyses and dashboards in QuickSight.
B
Explanation:
The optimal solution for near-real-time querying and visualization of logs is to integrate Amazon CloudWatch Logs with Amazon OpenSearch Service using subscription filters, which stream the logs directly into OpenSearch for querying and dashboarding:
“Use OpenSearch Service with CloudWatch Logs and create a subscription filter to stream log data in near real time into OpenSearch. Then use OpenSearch dashboards for visualization.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf
This approach offers low latency and avoids batch exports, unlike the scheduled Athena + S3 pattern.
A company maintains an Amazon Redshift provisioned cluster that the company uses for extract, transform, and load (ETL) operations to support critical analysis tasks. A sales team within the company maintains a Redshift cluster that the sales team uses for business intelligence (BI) tasks.
The sales team recently requested access to the data that is in the ETL Redshift cluster so the team can perform weekly summary analysis tasks. The sales team needs to join data from the ETL cluster with data that is in the sales team’s BI cluster.
The company needs a solution that will share the ETL cluster data with the sales team without interrupting the critical analysis tasks. The solution must minimize usage of the computing resources of the ETL cluster.
Which solution will meet these requirements?
- A . Set up the sales team Bl cluster as a consumer of the ETL cluster by using Redshift data sharing.
- B . Create materialized views based on the sales team’s requirements. Grant the sales team direct access to the ETL cluster.
- C . Create database views based on the sales team’s requirements. Grant the sales team direct access to the ETL cluster.
- D . Unload a copy of the data from the ETL cluster to an Amazon S3 bucket every week. Create an Amazon Redshift Spectrum table based on the content of the ETL cluster.
A
Explanation:
Redshift data sharing is a feature that enables you to share live data across different Redshift clusters without the need to copy or move data. Data sharing provides secure and governed access to data, while preserving the performance and concurrency benefits of Redshift. By setting up the sales team BI cluster as a consumer of the ETL cluster, the company can share the ETL cluster data with the sales team without interrupting the critical analysis tasks. The solution also minimizes the usage of the computing resources of the ETL cluster, as the data sharing does not consume any storage space or compute resources from the producer cluster. The other options are either not feasible or not efficient. Creating materialized views or database views would require the sales team to have direct access to the ETL cluster, which could interfere with the critical analysis tasks. Unloading a copy of the data from the ETL cluster to an Amazon S3 bucket every week would introduce additional latency and cost, as well as create data inconsistency issues.
Reference: Sharing data across Amazon Redshift clusters
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide, Chapter 2: Data Store
Management, Section 2.2: Amazon Redshift
A company uses Amazon RDS to store transactional data. The company runs an RDS DB instance in a private subnet. A developer wrote an AWS Lambda function with default settings to insert, update, or delete data in the DB instance.
The developer needs to give the Lambda function the ability to connect to the DB instance privately without using the public internet.
Which combination of steps will meet this requirement with the LEAST operational overhead? (Choose two.)
- A . Turn on the public access setting for the DB instance.
- B . Update the security group of the DB instance to allow only Lambda function invocations on the database port.
- C . Configure the Lambda function to run in the same subnet that the DB instance uses.
- D . Attach the same security group to the Lambda function and the DB instance. Include a self-referencing rule that allows access through the database port.
- E . Update the network ACL of the private subnet to include a self-referencing rule that allows access through the database port.
C,D
Explanation:
To enable the Lambda function to connect to the RDS DB instance privately without using the public internet, the best combination of steps is to configure the Lambda function to run in the same subnet that the DB instance uses, and attach the same security group to the Lambda function and the DB instance. This way, the Lambda function and the DB instance can communicate within the same private network, and the security group can allow traffic between them on the database port. This solution has the least operational overhead, as it does not require any changes to the public access setting, the network ACL, or the security group of the DB instance.
The other options are not optimal for the following reasons:
A company has an on-premises PostgreSQL database that contains customer data. The company wants to migrate the customer data to an Amazon Redshift data warehouse. The company has established a VPN connection between the on-premises database and AWS.
The on-premises database is continuously updated. The company must ensure that the data in Amazon Redshift is updated as quickly as possible.
Which solution will meet these requirements?
- A . Use the pg_dump utility to generate a backup of the PostgreSQL database. Use the AWS Schema Conversion Tool (AWS SCT) to upload the backup to Amazon Redshift. Set up a cron job to perform a backup. Upload the backup to Amazon Redshift every night.
- B . Create an AWS Database Migration Service (AWS DMS) full-load task. Set Amazon Redshift as the target. Configure the task to use the change data capture (CDC) feature.
- C . Use the pg_dump utility to generate a backup of the PostgreSQL database. Upload the backup to an Amazon S3 bucket. Use the COPY command to import the data into Amazon Redshift.
- D . Create an AWS Database Migration Service (AWS DMS) full-load task. Set Amazon Redshift as the target. Configure the task to perform a full load of the database to Amazon Redshift every night.
B
Explanation:
Option B is the only solution that supports near real-time updates from a continuously changing source to Amazon Redshift. The requirement says the on-premises PostgreSQL database is “continuously updated” and the target must be updated “as quickly as possible.” Nightly full backups or nightly full loads (Options A and D) inherently introduce at least a daily lag, which violates the freshness requirement. Similarly, exporting with pg_dump and reloading with COPY (Option C) is a batch approach and does not provide continuous change propagation.
The study material explicitly positions AWS Database Migration Service (DMS) for database migrations and highlights that it supports both full-load and change data capture (CDC), and that CDC enables continuous replication so ongoing changes can be applied after the initial load.
Therefore, a DMS task configured for full load + CDC provides the fastest ongoing synchronization pattern: it performs the initial migration and then continuously captures and applies changes so Redshift stays current with minimal delay compared to periodic batch reloads.
A data engineer is using Amazon Athena to analyze sales data that is in Amazon S3. The data engineer writes a query to retrieve sales amounts for 2023 for several products from a table named sales_data. However, the query does not return results for all of the products that are in the sales_data table. The data engineer needs to troubleshoot the query to resolve the issue.
The data engineer’s original query is as follows:
SELECT product_name, sum(sales_amount)
FROM sales_data
WHERE year = 2023
GROUP BY product_name
How should the data engineer modify the Athena query to meet these requirements?
- A . Replace sum(sales amount) with count(*J for the aggregation.
- B . Change WHERE year = 2023 to WHERE extractlyear FROM sales data) = 2023.
- C . Add HAVING sumfsales amount) > 0 after the GROUP BY clause.
- D . Remove the GROUP BY clause
B
Explanation:
The original query does not return results for all of the products because the year column in the sales_data table is not an integer, but a timestamp. Therefore, the WHERE clause does not filter the data correctly, and only returns the products that have a null value for the year column. To fix this, the data engineer should use the extract function to extract the year from the timestamp and compare it with 2023. This way, the query will return the correct results for all of the products in the sales_data table. The other options are either incorrect or irrelevant, as they do not address the root cause of the issue. Replacing sum with count does not change the filtering condition, adding HAVING clause does not affect the grouping logic, and removing the GROUP BY clause does not solve the problem of missing products.
Reference: Troubleshooting JSON queries – Amazon Athena (Section: JSON related errors)
When I query a table in Amazon Athena, the TIMESTAMP result is empty (Section: Resolution)
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide (Chapter 7, page 197)
A company uses Amazon S3 and AWS Glue Data Catalog to manage a data lake that contains contact information for customers. The company uses PySpark and AWS Glue jobs with a Dynamic Frame to run a workflow that processes data within the data lake.
A data engineer notices that the workflow is generating errors as a result of how customer postal codes are stored in the data lake. Some postal codes include unnecessary numbers or invalid characters.
The data engineer needs a solution to address the errors and correct the postal codes in the data lake.
Which solution will meet these requirements?
- A . Create a schema definition for PySpark that matches the format the processing workflow requires for postal codes. Pass the schema to the Dynamic Frame during processing.
- B . Use AWS Glue workflow properties to allow job state sharing. Configure the AWS Glue jobs to read
values from the postal code column by using the properties from a previously successful run of the jobs. - C . Configure the column Push Down Predicate setting and the catalog Partition Predicate settings for the postal code column in the Dynamic Frame.
- D . Set the Dynamic Frame additional options parameter uses List Implementation to True.
A
Explanation:
Option A is the only choice that directly addresses the root cause: inconsistent postal-code formatting causing processing errors. In AWS Glue, a Dynamic Frame can encounter issues when incoming data contains unexpected types or malformed values. Providing an explicit PySpark schema forces the postal code column to be interpreted consistently (for example, as a string with the expected structure), which prevents downstream steps from failing because of unexpected characters or mixed representations. After enforcing a consistent schema, the Glue job can standardize values (such as trimming extra digits, removing invalid characters, and normalizing case) and then write the corrected output back to Amazon S3 so the data lake is fixed at the source.
The document reinforces that AWS Glue is the managed ETL service used to transform data and move it between stores, which is exactly what is needed to correct bad values and persist clean outputs back into the lake. It also highlights that AWS Glue can perform validation as part of the ETL process, supporting the idea that data can be checked and corrected during processing rather than allowing bad values to break the workflow.
Options B, C, and D do not fix data quality or formatting issues; they relate to workflow state sharing, partition filtering, or internal implementation details, not cleansing invalid postal codes.
A company receives a daily file that contains customer data in .xls format. The company stores the file in Amazon S3. The daily file is approximately 2 GB in size.
A data engineer concatenates the column in the file that contains customer first names and the column that contains customer last names. The data engineer needs to determine the number of distinct customers in the file.
Which solution will meet this requirement with the LEAST operational effort?
- A . Create and run an Apache Spark job in an AWS Glue notebook. Configure the job to read the S3 file and calculate the number of distinct customers.
- B . Create an AWS Glue crawler to create an AWS Glue Data Catalog of the S3 file. Run SQL queries from Amazon Athena to calculate the number of distinct customers.
- C . Create and run an Apache Spark job in Amazon EMR Serverless to calculate the number of distinct customers.
- D . Use AWS Glue DataBrew to create a recipe that uses the COUNT_DISTINCT aggregate function to calculate the number of distinct customers.
D
Explanation:
AWS Glue DataBrew is a visual data preparation tool that allows you to clean, normalize, and transform data without writing code. You can use DataBrew to create recipes that define the steps to apply to your data, such as filtering, renaming, splitting, or aggregating columns. You can also use DataBrew to run jobs that execute the recipes on your data sources, such as Amazon S3, Amazon Redshift, or Amazon Aurora. DataBrew integrates with AWS Glue Data Catalog, which is a centralized metadata repository for your data assets1.
The solution that meets the requirement with the least operational effort is to use AWS Glue DataBrew to create a recipe that uses the COUNT_DISTINCT aggregate function to calculate the number of distinct customers. This solution has the following advantages:
It does not require you to write any code, as DataBrew provides a graphical user interface that lets
you explore, transform, and visualize your data. You can use DataBrew to concatenate the columns that contain customer first names and last names, and then use the COUNT_DISTINCT aggregate function to count the number of unique values in the resulting column2.
It does not require you to provision, manage, or scale any servers, clusters, or notebooks, as DataBrew is a fully managed service that handles all the infrastructure for you. DataBrew can automatically scale up or down the compute resources based on the size and complexity of your data and recipes1.
It does not require you to create or update any AWS Glue Data Catalog entries, as DataBrew can automatically create and register the data sources and targets in the Data Catalog. DataBrew can also use the existing Data Catalog entries to access the data in S3 or other sources3.
Option A is incorrect because it suggests creating and running an Apache Spark job in an AWS Glue notebook. This solution has the following disadvantages:
It requires you to write code, as AWS Glue notebooks are interactive development environments that allow you to write, test, and debug Apache Spark code using Python or Scala. You need to use the Spark SQL or the Spark DataFrame API to read the S3 file and calculate the number of distinct customers.
It requires you to provision and manage a development endpoint, which is a serverless Apache Spark environment that you can connect to your notebook. You need to specify the type and number of workers for your development endpoint, and monitor its status and metrics.
It requires you to create or update the AWS Glue Data Catalog entries for the S3 file, either manually or using a crawler. You need to use the Data Catalog as a metadata store for your Spark job, and specify the database and table names in your code.
Option B is incorrect because it suggests creating an AWS Glue crawler to create an AWS Glue Data Catalog of the S3 file, and running SQL queries from Amazon Athena to calculate the number of distinct customers. This solution has the following disadvantages:
It requires you to create and run a crawler, which is a program that connects to your data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in the Data Catalog. You need to specify the data store, the IAM role, the schedule, and the output database for your crawler.
It requires you to write SQL queries, as Amazon Athena is a serverless interactive query service that allows you to analyze data in S3 using standard SQL. You need to use Athena to concatenate the columns that contain customer first names and last names, and then use the COUNT(DISTINCT) aggregate function to count the number of unique values in the resulting column.
Option C is incorrect because it suggests creating and running an Apache Spark job in Amazon EMR Serverless to calculate the number of distinct customers. This solution has the following disadvantages:
It requires you to write code, as Amazon EMR Serverless is a service that allows you to run Apache Spark jobs on AWS without provisioning or managing any infrastructure. You need to use the Spark SQL or the Spark DataFrame API to read the S3 file and calculate the number of distinct customers.
It requires you to create and manage an Amazon EMR Serverless cluster, which is a fully managed and scalable Spark environment that runs on AWS Fargate. You need to specify the cluster name, the IAM role, the VPC, and the subnet for your cluster, and monitor its status and metrics.
It requires you to create or update the AWS Glue Data Catalog entries for the S3 file, either manually or using a crawler. You need to use the Data Catalog as a metadata store for your Spark job, and specify the database and table names in your code.
1: AWS Glue DataBrew – Features
2: Working with recipes – AWS Glue DataBrew
3: Working with data sources and data targets – AWS Glue DataBrew [4]: AWS Glue notebooks – AWS Glue
[5]: Development endpoints – AWS Glue
[6]: Populating the AWS Glue Data Catalog – AWS Glue
[7]: Crawlers – AWS Glue
[8]: Amazon Athena – Features
[9]: Amazon EMR Serverless – Features
[10]: Creating an Amazon EMR Serverless cluster – Amazon EMR
[11]: Using the AWS Glue Data Catalog with Amazon EMR Serverless – Amazon EMR
A retail company needs to implement a solution to capture data updates from multiple Amazon Aurora MySQL databases. The company needs to make the updates available for analytics in near real time. The solution must be serverless and require minimal maintenance.
Which solution will meet these requirements with the LEAST operational overhead?
- A . Set up AWS Database Migration Service (AWS DMS) tasks that perform schema conversions for each database. Load the changes into Amazon Redshift Serverless.
- B . Use Amazon Managed Streaming for Apache Kafka (Amazon MSK) Connect with Debezium connectors to load data into Amazon Redshift Serverless.
- C . Use AWS Database Migration Service (AWS DMS) to set up binary log replication to Amazon Kinesis Data Streams. Load the data into Amazon Redshift Serverless after schema conversion.
- D . Use Aurora zero-ETL integrations with Amazon Redshift Serverless for each database to load Aurora MySQL changes in Amazon Redshift Serverless.
D
Explanation:
Aurora zero-ETL integration with Amazon Redshift Serverless is specifically designed to provide near real-time analytics on Aurora transactional data without the need to build or manage data pipelines. This integration continuously replicates data changes from Aurora MySQL into Redshift Serverless automatically.
The solution is fully serverless, requires no infrastructure provisioning, and eliminates the need for schema conversion, replication tasks, or streaming frameworks. Data is kept in sync with low latency, making it immediately available for analytical queries in Redshift Serverless.
AWS DMS-based solutions require task configuration, monitoring, and maintenance. Amazon MSK with Debezium introduces significant operational complexity and is not serverless. Kinesis-based replication pipelines require additional services and ongoing operational management.
Aurora zero-ETL integration is the lowest-maintenance, highest-efficiency solution and is explicitly recommended by AWS for this use case.
