Practice Free Amazon DEA-C01 Exam Online Questions
A company has an Amazon S3Cbased data lake. The data lake contains datasets that belong to multiple departments. The data lake ingests millions of customer records each day.
A data engineer needs to design an access and storage solution that allows departments to access only the subset of the company’s dataset that each department requires. The solution must follow the principle of least privilege.
Which solution will meet these requirements with the LEAST operational effort?
- A . Define IAM policies and IAM roles for each department. Specify the S3 access paths from the data lake that each team can access.
- B . Set up Amazon Redshift and Amazon Redshift Spectrum as the primary entry points for the data lake. Define an IAM role that Amazon Redshift can assume. Configure the IAM role to grant access to the data that is in Amazon S3.
- C . Set up AWS Lake Formation. Assign LF-Tags to AWS Glue Data Catalog resources. Enable Lake Formation tag-based access control (LF-TBAC).
- D . Deploy an Amazon RDS for PostgreSQL database that has the aws_s3 extension installed. Configure AWS Step Functions events to invoke an AWS Lambda function to sync the data lake with the database.
C
Explanation:
AWS Lake Formation is specifically designed to simplify fine-grained access control for data lakes stored in Amazon S3. By using Lake Formation tag-based access control (LF-TBAC), administrators can define access policies once and apply them dynamically based on tags rather than managing individual IAM policies.
LF-Tags can be assigned to databases, tables, and columns in the AWS Glue Data Catalog. Departments are granted permissions based on tags, ensuring that each department can access only the data it is authorized to view. This approach scales efficiently as datasets and departments grow, which is critical for data lakes ingesting millions of records daily.
Managing IAM policies per department is operationally complex and error-prone. Redshift-based access centralization limits flexibility and introduces unnecessary infrastructure. Syncing data into an RDS database adds cost, latency, and maintenance overhead.
Lake Formation provides centralized governance, auditing, and least-privilege enforcement with minimal administrative effort, making it the optimal solution.
A company uses an Amazon QuickSight dashboard to monitor usage of one of the company’s applications. The company uses AWS Glue jobs to process data for the dashboard. The company stores the data in a single Amazon S3 bucket. The company adds new data every day.
A data engineer discovers that dashboard queries are becoming slower over time. The data engineer determines that the root cause of the slowing queries is long-running AWS Glue jobs.
Which actions should the data engineer take to improve the performance of the AWS Glue jobs? (Choose two.)
- A . Partition the data that is in the S3 bucket. Organize the data by year, month, and day.
- B . Increase the AWS Glue instance size by scaling up the worker type.
- C . Convert the AWS Glue schema to the DynamicFrame schema class.
- D . Adjust AWS Glue job scheduling frequency so the jobs run half as many times each day.
- E . Modify the 1AM role that grants access to AWS glue to grant access to all S3 features.
A,B
Explanation:
Partitioning the data in the S3 bucket can improve the performance of AWS Glue jobs by reducing the amount of data that needs to be scanned and processed. By organizing the data by year, month, and day, the AWS Glue job can use partition pruning to filter out irrelevant data and only read the data that matches the query criteria. This can speed up the data processing and reduce the cost of running the AWS Glue job. Increasing the AWS Glue instance size by scaling up the worker type can also improve the performance of AWS Glue jobs by providing more memory and CPU resources for the Spark execution engine. This can help the AWS Glue job handle larger data sets and complex transformations more efficiently. The other options are either incorrect or irrelevant, as they do not affect the performance of the AWS Glue jobs. Converting the AWS Glue schema to the DynamicFrame schema class does not improve the performance, but rather provides additional functionality and flexibility for data manipulation. Adjusting the AWS Glue job scheduling frequency does not improve the performance, but rather reduces the frequency of data updates. Modifying the IAM role that grants access to AWS Glue does not improve the performance, but rather affects the security and permissions of the AWS Glue service.
Reference: Optimising Glue Scripts for Efficient Data Processing: Part 1 (Section: Partitioning Data in S3)
Best practices to optimize cost and performance for AWS Glue streaming ETL jobs (Section: Development tools)
Monitoring with AWS Glue job run insights (Section: Requirements)
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide (Chapter 5, page 133)
A gaming company uses AWS Glue to perform read and write operations on Apache Iceberg tables for real-time streaming data. The data in the Iceberg tables is stored in Apache Parquet format. The company is experiencing slow query performance.
Which solutions will improve query performance? (Select TWO)
- A . Use AWS Glue Data Catalog to generate column-level statistics for the Iceberg tables on a schedule.
- B . Use AWS Glue Data Catalog to automatically compact the Iceberg tables.
- C . Use AWS Glue Data Catalog to automatically optimize indexes for the Iceberg tables.
- D . Use AWS Glue Data Catalog to enable copy-on-write for the Iceberg tables.
- E . Use AWS Glue Data Catalog to generate views for the Iceberg tables.
A B
Explanation:
Apache Iceberg query performance depends heavily on metadata optimization and file layout efficiency. Generating column-level statistics allows query engines such as AWS Glue, Amazon Athena, and Amazon Redshift to perform predicate pushdown and data skipping. These statistics help avoid scanning unnecessary Parquet files and significantly reduce query execution time.
Real-time streaming workloads often generate many small files, which degrades query performance. Table compaction consolidates small Parquet files into fewer, larger files, improving scan efficiency and reducing metadata overhead. Iceberg supports compaction as a core optimization technique, and Glue can schedule these operations automatically.
Iceberg does not use traditional indexes, so index optimization is not applicable. Copy-on-write affects write semantics and consistency, not query speed. Views do not improve physical query performance.
Therefore, generating statistics and compacting files are the two most effective optimizations.
A company processes 500 GB of audience and advertising data daily, storing CSV files in Amazon S3 with schemas registered in AWS Glue Data Catalog. They need to convert these files to Apache Parquet format and store them in an S3 bucket.
The solution requires a long-running workflow with 15 GiB memory capacity to process the data concurrently, followed by a correlation process that begins only after the first two processes complete.
- A . Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate the workflow by using AWS Glue. Configure AWS Glue to begin the third process after the first two processes have finished.
- B . Use Amazon EMR to run each process in the workflow. Create an Amazon Simple Queue Service (Amazon SQS) queue to handle messages that indicate the completion of the first two processes. Configure an AWS Lambda function to process the SQS queue by running the third process.
- C . Use AWS Glue workflows to run the first two processes in parallel. Ensure that the third process starts after the first two processes have finished.
- D . Use AWS Step Functions to orchestrate a workflow that uses multiple AWS Lambda functions.
Ensure that the third process starts after the first two processes have finished.
C
Explanation:
AWS Glue Workflows can coordinate multiple ETL jobs and triggers. They support parallel execution and sequential dependencies, which is ideal for concurrent data processing followed by correlation steps, all with minimal operational overhead.
“Use AWS Glue Workflows to orchestrate multiple ETL jobs in sequence or in parallel, supporting conditional triggers and dependency management.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf
A company uses Amazon Redshift as a data warehouse solution. One of the datasets that the company stores in Amazon Redshift contains data for a vendor.
Recently, the vendor asked the company to transfer the vendor’s data into the vendor’s Amazon S3 bucket once each week.
Which solution will meet this requirement?
- A . Create an AWS Lambda function to connect to the Redshift data warehouse. Configure the Lambda function to use the Redshift COPY command to copy the required data to the vendor’s S3 bucket on a schedule.
- B . Create an AWS Glue job to connect to the Redshift data warehouse. Configure the AWS Glue job to use the Redshift UNLOAD command to load the required data to the vendor’s S3 bucket on a schedule.
- C . Use the Amazon Redshift data sharing feature. Set the vendor’s S3 bucket as the destination.
Configure the source to be as a custom SQL query that selects the required data. - D . Configure Amazon Redshift Spectrum to use the vendor’s S3 bucket as destination. Enable data querying in both directions.
B
Explanation:
The Redshift UNLOAD command is specifically designed to export query results to Amazon S3, and AWS Glue can orchestrate this as part of a scheduled job. This is the cleanest and most appropriate approach for recurring weekly data transfers:
“Use the Redshift UNLOAD command with AWS Glue to export data to Amazon S3. This pattern enables routine exports of selected data to external locations.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf
This avoids complexities of Redshift Spectrum or unsupported use of COPY commands in Lambda.
A company has a data lake in Amazon S3. The company collects AWS CloudTrail logs for multiple applications. The company stores the logs in the data lake, catalogs the logs in AWS Glue, and partitions the logs based on the year. The company uses Amazon Athena to analyze the logs.
Recently, customers reported that a query on one of the Athena tables did not return any data. A data engineer must resolve the issue.
Which combination of troubleshooting steps should the data engineer take? (Select TWO.)
- A . Confirm that Athena is pointing to the correct Amazon S3 location.
- B . Increase the query timeout duration.
- C . Use the MSCK REPAIR TABLE command.
- D . Restart Athena.
- E . Delete and recreate the problematic Athena table.
A,C
Explanation:
The problem likely arises from Athena not being able to read from the correct S3 location or missing partitions. The two most relevant troubleshooting steps involve checking the S3 location and repairing the table metadata.
A data engineer needs to create an empty copy of an existing table in Amazon Athena to perform data processing tasks. The existing table in Athena contains 1,000 rows.
Which query will meet this requirement?
- A . CREATE TABLE new_table LIKE old_table;
- B . CREATE TABLE new_table AS SELECT * FROM old_table WITH NO DATA;
- C . CREATE TABLE new_table AS SELECT * FROM old_table;
- D . CREATE TABLE new_table AS SELECT * FROM old_table WHERE 1=1;
B
Explanation:
In Amazon Athena, you can use CREATE TABLE AS SELECT with WITH NO DATA to create an empty copy of an existing table’s schema:
“The query CREATE TABLE new_table AS SELECT * FROM old_table WITH NO DATA; creates a new table with the same schema but without copying over the data.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf This is the most efficient way to create an empty version of the existing table.
A data engineer needs to maintain a central metadata repository that users access through Amazon EMR and Amazon Athena queries. The repository needs to provide the schema and properties of many tables. Some of the metadata is stored in Apache Hive. The data engineer needs to import the metadata from Hive into the central metadata repository.
Which solution will meet these requirements with the LEAST development effort?
- A . Use Amazon EMR and Apache Ranger.
- B . Use a Hive metastore on an EMR cluster.
- C . Use the AWS Glue Data Catalog.
- D . Use a metastore on an Amazon RDS for MySQL DB instance.
C
Explanation:
The AWS Glue Data Catalog is an Apache Hive metastore-compatible catalog that provides a central metadata repository for various data sources and formats. You can use the AWS Glue Data Catalog as an external Hive metastore for Amazon EMR and Amazon Athena queries, and import metadata from existing Hive metastores into the Data Catalog. This solution requires the least development effort, as you can use AWS Glue crawlers to automatically discover and catalog the metadata from Hive, and use the AWS Glue console, AWS CLI, or Amazon EMR API to configure the Data Catalog as the Hive metastore. The other options are either more complex or require additional steps, such as setting up Apache Ranger for security, managing a Hive metastore on an EMR cluster or an RDS instance, or migrating the metadata manually.
Reference: Using the AWS Glue Data Catalog as the metastore for Hive (Section: Specifying AWS Glue Data Catalog as the metastore)
Metadata Management: Hive Metastore vs AWS Glue (Section: AWS Glue Data Catalog)
AWS Glue Data Catalog support for Spark SQL jobs (Section: Importing metadata from an existing Hive metastore)
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide (Chapter 5, page 131)
A company uses Amazon S3 as a data lake. The company sets up a data warehouse by using a multi-node Amazon Redshift cluster. The company organizes the data files in the data lake based on the data source of each data file.
The company loads all the data files into one table in the Redshift cluster by using a separate COPY command for each data file location. This approach takes a long time to load all the data files into the table. The company must increase the speed of the data ingestion. The company does not want to increase the cost of the process.
Which solution will meet these requirements?
- A . Use a provisioned Amazon EMR cluster to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift.
- B . Load all the data files in parallel into Amazon Aurora. Run an AWS Glue job to load the data into Amazon Redshift.
- C . Use an AWS Glue job to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift.
- D . Create a manifest file that contains the data file locations. Use a COPY command to load the data into Amazon Redshift.
D
Explanation:
The company is facing performance issues loading data into Amazon Redshift because it is issuing separate COPY commands for each data file location. The most efficient way to increase the speed of data ingestion into Redshift without increasing the cost is to use a manifest file.
Option D: Create a manifest file that contains the data file locations. Use a COPY command to load the data into Amazon Redshift.A manifest file provides a list of all the data files, allowing the COPY command to load all files in parallel from different locations in Amazon S3. This significantly improves the loading speed without adding costs, as it optimizes the data loading process in a single COPY operation.
Other options (A, B, C) involve additional steps that would either increase the cost (provisioning clusters, using Glue, etc.) or do not address the core issue of needing a unified and efficient COPY process.
Reference: Amazon Redshift COPY Command
Redshift Manifest File Documentation
A company has a gaming application that stores data in Amazon DynamoDB tables. A data engineer needs to ingest the game data into an Amazon OpenSearch Service cluster. Data updates must occur in near real time.
Which solution will meet these requirements?
- A . Use AWS Step Functions to periodically export data from the Amazon DynamoDB tables to an Amazon S3 bucket. Use an AWS Lambda function to load the data into Amazon OpenSearch Service.
- B . Configure an AW5 Glue job to have a source of Amazon DynamoDB and a destination of Amazon OpenSearch Service to transfer data in near real time.
- C . Use Amazon DynamoDB Streams to capture table changes. Use an AWS Lambda function to process and update the data in Amazon OpenSearch Service.
- D . Use a custom OpenSearch plugin to sync data from the Amazon DynamoDB tables.
C
Explanation:
Problem Analysis:
The company uses DynamoDB for gaming data storage and needs to ingest data into Amazon OpenSearch Service in near real time.
Data updates must propagate quickly to OpenSearch for analytics or search purposes.
Key Considerations:
DynamoDB Streams provide near-real-time capture of table changes (inserts, updates, and deletes).
Integration with AWS Lambda allows seamless processing of these changes.
OpenSearch offers APIs for indexing and updating documents, which Lambda can invoke.
Solution Analysis:
Option A: Step Functions with Periodic Export
Not suitable for near-real-time updates; introduces significant latency. Operationally complex to manage periodic exports and S3 data ingestion.
Option B: AWS Glue Job
AWS Glue is designed for ETL workloads but lacks real-time processing capabilities.
Option C: DynamoDB Streams + Lambda
DynamoDB Streams capture changes in near real time.
Lambda can process these streams and use the OpenSearch API to update the index.
This approach provides low latency and seamless integration with minimal operational overhead.
Option D: Custom OpenSearch Plugin
Writing a custom plugin adds complexity and is unnecessary with existing AWS integrations.
Implementation Steps:
Enable DynamoDB Streams for the relevant DynamoDB tables.
Create a Lambda function to process stream records:
Parse insert, update, and delete events.
Use OpenSearch APIs to index or update documents based on the event type.
Set up a trigger to invoke the Lambda function whenever there are changes in the DynamoDB Stream.
Monitor and log errors for debugging and operational health.
Amazon DynamoDB Streams Documentation
AWS Lambda and DynamoDB Integration
Amazon OpenSearch Service APIs
