Practice Free Amazon DEA-C01 Exam Online Questions
A company is developing machine learning (ML) models. A data engineer needs to apply data quality rules to training data. The company stores the training data in an Amazon S3 bucket.
- A . Create an AWS Lambda function to check data quality and to raise exceptions in the code.
- B . Create an AWS Glue DataBrew project for the data in the S3 bucket. Create a ruleset for the data quality rules. Create a profile job to run the data quality rules. Use Amazon EventBridge to run the profile job when data is added to the S3 bucket.
- C . Create an Amazon EMR provisioned cluster. Add a Python data quality package.
- D . Create AWS Lambda functions to evaluate data quality rules and orchestrate with AWS Step Functions.
B
Explanation:
AWS Glue DataBrew provides a no-code way to define and run data quality rulesets for data stored in S3. You can trigger profiling jobs via Amazon EventBridge on new uploads for automated checks.
“Use AWS Glue DataBrew to define and run data quality rules on S3 datasets with minimal coding effort. Automate validation by triggering jobs through EventBridge.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf
A retail company uses an Amazon Redshift data warehouse and an Amazon S3 bucket. The company ingests retail order data into the S3 bucket every day.
The company stores all order data at a single path within the S3 bucket. The data has more than 100 columns. The company ingests the order data from a third-party application that generates more than 30 files in CSV format every day. Each CSV file is between 50 and 70 MB in size.
The company uses Amazon Redshift Spectrum to run queries that select sets of columns. Users aggregate metrics based on daily orders. Recently, users have reported that the performance of the queries has degraded. A data engineer must resolve the performance issues for the queries.
Which combination of steps will meet this requirement with LEAST developmental effort? (Select TWO.)
- A . Configure the third-party application to create the files in a columnar format.
- B . Develop an AWS Glue ETL job to convert the multiple daily CSV files to one file for each day.
- C . Partition the order data in the S3 bucket based on order date.
- D . Configure the third-party application to create the files in JSON format.
- E . Load the JSON data into the Amazon Redshift table in a SUPER type column.
A,C
Explanation:
The performance issue in Amazon Redshift Spectrum queries arises due to the nature of CSV files, which are row-based storage formats. Spectrum is more optimized for columnar formats, which significantly improve performance by reducing the amount of data scanned. Also, partitioning data based on relevant columns like order date can further reduce the amount of data scanned, as queries can focus only on the necessary partitions.
A company needs a solution to manage costs for an existing Amazon DynamoDB table. The company also needs to control the size of the table. The solution must not disrupt any ongoing read or write operations. The company wants to use a solution that automatically deletes data from the table after 1 month.
Which solution will meet these requirements with the LEAST ongoing maintenance?
- A . Use the DynamoDB TTL feature to automatically expire data based on timestamps.
- B . Configure a scheduled Amazon EventBridge rule to invoke an AWS Lambda function to check for data that is older than 1 month. Configure the Lambda function to delete old data.
- C . Configure a stream on the DynamoDB table to invoke an AWS Lambda function. Configure the Lambda function to delete data in the table that is older than 1 month.
- D . Use an AWS Lambda function to periodically scan the DynamoDB table for data that is older than 1 month. Configure the Lambda function to delete old data.
A
Explanation:
The requirement is to manage the size of an Amazon DynamoDB table by automatically deleting data older than 1 month without disrupting ongoing read or write operations. The simplest and most maintenance-free solution is to use DynamoDB Time-to-Live (TTL).
Option A: Use the DynamoDB TTL feature to automatically expire data based on timestamps. DynamoDB TTL allows you to specify an attribute (e.g., a timestamp) that defines when items in the table should expire. After the expiration time, DynamoDB automatically deletes the items, freeing up storage space and keeping the table size under control without manual intervention or disruptions to ongoing operations.
Other options involve higher maintenance and manual scheduling or scanning operations, which increase complexity unnecessarily compared to the native TTL feature.
Reference: DynamoDB Time-to-Live (TTL)
A retail company has a customer data hub in an Amazon S3 bucket. Employees from many countries use the data hub to support company-wide analytics. A governance team must ensure that the company’s data analysts can access data only for customers who are within the same country as the analysts.
Which solution will meet these requirements with the LEAST operational effort?
- A . Create a separate table for each country’s customer data. Provide access to each analyst based on the country that the analyst serves.
- B . Register the S3 bucket as a data lake location in AWS Lake Formation. Use the Lake Formation row-level security features to enforce the company’s access policies.
- C . Move the data to AWS Regions that are close to the countries where the customers are. Provide access to each analyst based on the country that the analyst serves.
- D . Load the data into Amazon Redshift. Create a view for each country. Create separate 1AM roles for each country to provide access to data from each country. Assign the appropriate roles to the analysts.
B
Explanation:
AWS Lake Formation is a service that allows you to easily set up, secure, and manage data lakes. One of the features of Lake Formation is row-level security, which enables you to control access to specific rows or columns of data based on the identity or role of the user. This feature is useful for scenarios where you need to restrict access to sensitive or regulated data, such as customer data from different countries. By registering the S3 bucket as a data lake location in Lake Formation, you can use the Lake Formation console or APIs to define and apply row-level security policies to the data in the bucket. You can also use Lake Formation blueprints to automate the ingestion and transformation of data from various sources into the data lake. This solution requires the least operational effort compared to the other options, as it does not involve creating or moving data, or managing multiple tables, views, or roles.
Reference: AWS Lake Formation
Row-Level Security
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide, Chapter 4: Data Lakes and
Data Warehouses, Section 4.2: AWS Lake Formation
A company wants to use Apache Spark jobs that run on an Amazon EMR cluster to process streaming data. The Spark jobs will transform and store the data in an Amazon S3 bucket. The company will use Amazon Athena to perform analysis.
The company needs to optimize the data format for analytical queries.
Which solutions will meet these requirements with the SHORTEST query times? (Select TWO.)
- A . Use Avro format. Use AWS Glue Data Catalog to track schema changes.
- B . Use ORC format. Use AWS Glue Data Catalog to track schema changes.
- C . Use Apache Parquet format. Use an external Amazon DynamoDB table to track schema changes.
- D . Use Apache Parquet format. Use AWS Glue Data Catalog to track schema changes.
- E . Use ORC format. Store schema definitions in separate files in Amazon S3.
B,D
Explanation:
For high-performance analytics in Athena and EMR Spark, the columnar storage formats Parquet and ORC deliver the shortest query times. These formats minimize I/O, enable predicate pushdown, and support efficient compression and parallel reads.
When integrated with the AWS Glue Data Catalog, schema evolution is automatically tracked and managed across queries.
This combination (Parquet/ORC + Glue Catalog) is explicitly recommended by AWS for analytical workloads on S3 data lakes.
“For analytical workloads using Amazon EMR and Athena, store data in columnar formats such as Parquet or ORC. These formats provide optimized compression and query performance.”
C Ace the AWS Certified Data Engineer C Associate Certification C version 2 C apple.pdf
“Use the AWS Glue Data Catalog to manage table schemas and schema evolution for columnar data formats such as Parquet and ORC.”
C Ace the AWS Certified Data Engineer C Associate Certification C version 2 C apple.pdf
Therefore, using ORC or Parquet along with the Glue Data Catalog achieves the shortest Athena query times and seamless schema management.
A company stores employee data in Amazon Redshift A table named Employee uses columns named Region ID, Department ID, and Role ID as a compound sort key.
Which queries will MOST increase the speed of a query by using a compound sort key of the table? (Select TWO.)
- A . Select * from Employee where Region ID=’North America’;
- B . Select * from Employee where Region ID=’North America’ and Department ID=20;
- C . Select * from Employee where Department ID=20 and Region ID=’North America’;
- D . Select " from Employee where Role ID=50;
- E . Select * from Employee where Region ID=’North America’ and Role ID=50;
B,C
Explanation:
In Amazon Redshift, a compound sort key is designed to optimize the performance of queries that use filtering and join conditions on the columns in the sort key. A compound sort key orders the data based on the first column, followed by the second, and so on. In the scenario given, the compound sort key consists of Region ID, Department ID, and Role ID. Therefore, queries that filter on the leading columns of the sort key are more likely to benefit from this order.
Option B: "Select * from Employee where Region ID=’North America’ and Department ID=20;"This query will perform well because it uses both the Region ID and Department ID, which are the first two columns of the compound sort key. The order of the columns in the WHERE clause matches the order in the sort key, thus allowing the query to scan fewer rows and improve performance.
Option C: "Select * from Employee where Department ID=20 and Region ID=’North America’; “This query also benefits from the compound sort key because it includes both Region ID and Department ID, which are the first two columns in the sort key. Although the order in the WHERE clause does not match exactly, Amazon Redshift will still leverage the sort key to reduce the amount of data scanned, improving query speed.
Options A, D, and E are less optimal because they do not utilize the sort key as effectively:
Option A only filters by the Region ID, which may still use the sort key but does not take full advantage of the compound nature.
Option D uses only Role ID, the last column in the compound sort key, which will not benefit much from sorting since it is the third key in the sort order.
Option E filters on Region ID and Role ID but skips the Department ID column, making it less efficient for the compound sort key.
Reference: Amazon Redshift Documentation – Sorting Data
AWS Certified Data Analytics Study Guide
AWS Certification – Data Engineer Associate Exam Guide
A company currently uses a provisioned Amazon EMR cluster that includes general purpose Amazon EC2 instances. The EMR cluster uses EMR managed scaling between one to five task nodes for the company’s long-running Apache Spark extract, transform, and load (ETL) job. The company runs the ETL job every day.
When the company runs the ETL job, the EMR cluster quickly scales up to five nodes. The EMR cluster often reaches maximum CPU usage, but the memory usage remains under 30%.
The company wants to modify the EMR cluster configuration to reduce the EMR costs to run the daily ETL job.
Which solution will meet these requirements MOST cost-effectively?
- A . Increase the maximum number of task nodes for EMR managed scaling to 10.
- B . Change the task node type from general purpose EC2 instances to memory optimized EC2 instances.
- C . Switch the task node type from general purpose EC2 instances to compute optimized EC2 instances.
- D . Reduce the scaling cooldown period for the provisioned EMR cluster.
C
Explanation:
The company’s Apache Spark ETL job on Amazon EMR uses high CPU but low memory, meaning that compute-optimized EC2 instances would be the most cost-effective choice. These instances are designed for high-performance compute applications, where CPU usage is high, but memory needs are minimal, which is exactly the case here.
Compute Optimized Instances:
Compute-optimized instances, such as the C5 series, provide a higher ratio of CPU to memory, which is more suitable for jobs with high CPU usage and relatively low memory consumption.
Switching from general-purpose EC2 instances to compute-optimized instances can reduce costs while improving performance, as these instances are optimized for workloads like Spark jobs that perform a lot of computation.
Reference: Amazon EC2 Compute Optimized Instances
Managed Scaling: The EMR cluster’s scaling is currently managed between 1 and 5 nodes, so changing the instance type will leverage the current scaling strategy but optimize it for the workload.
Alternatives Considered:
A (Increase task nodes to 10): Increasing the number of task nodes would increase costs without necessarily improving performance. Since memory usage is low, the bottleneck is more likely the CPU, which compute-optimized instances can handle better.
B (Memory optimized instances): Memory-optimized instances are not suitable since the current job is CPU-bound, and memory usage remains low (under 30%).
D (Reduce scaling cooldown): This could marginally improve scaling speed but does not address the need for cost optimization and improved CPU performance.
Reference: Amazon EMR Cluster Optimization
Compute Optimized EC2 Instances
A data engineer needs to build an enterprise data catalog based on the company’s Amazon S3 buckets and Amazon RDS databases. The data catalog must include storage format metadata for the data in the catalog.
Which solution will meet these requirements with the LEAST effort?
- A . Use an AWS Glue crawler to scan the S3 buckets and RDS databases and build a data catalog. Use data stewards to inspect the data and update the data catalog with the data format.
- B . Use an AWS Glue crawler to build a data catalog. Use AWS Glue crawler classifiers to recognize the format of data and store the format in the catalog.
- C . Use Amazon Macie to build a data catalog and to identify sensitive data elements. Collect the data format information from Macie.
- D . Use scripts to scan data elements and to assign data classifications based on the format of the data.
A
Explanation:
To build an enterprise data catalog with metadata for storage formats, the easiest and most efficient solution is using an AWS Glue crawler. The Glue crawler can scan Amazon S3 buckets and Amazon RDS databases to automatically create a data catalog that includes metadata such as the schema and storage format (e.g., CSV, Parquet, etc.). By using AWS Glue crawler classifiers, you can configure the crawler to recognize the format of the data and store this information directly in the catalog.
Option B: Use an AWS Glue crawler to build a data catalog. Use AWS Glue crawler classifiers to recognize the format of data and store the format in the catalog. This option meets the requirements with the least effort because Glue crawlers automate the discovery and cataloging of data from multiple sources, including S3 and RDS, while recognizing various file formats via classifiers.
Other options (A, C, D) involve additional manual steps, like having data stewards inspect the data, or using services like Amazon Macie that focus more on sensitive data detection rather than format cataloging.
Reference: AWS Glue Crawler Documentation
AWS Glue Classifiers
A company has a data warehouse in Amazon Redshift. To comply with security regulations, the company needs to log and store all user activities and connection activities for the data warehouse.
Which solution will meet these requirements?
- A . Create an Amazon S3 bucket. Enable logging for the Amazon Redshift cluster. Specify the S3 bucket in the logging configuration to store the logs.
- B . Create an Amazon Elastic File System (Amazon EFS) file system. Enable logging for the Amazon Redshift cluster. Write logs to the EFS file system.
- C . Create an Amazon Aurora MySQL database. Enable logging for the Amazon Redshift cluster. Write the logs to a table in the Aurora MySQL database.
- D . Create an Amazon Elastic Block Store (Amazon EBS) volume. Enable logging for the Amazon Redshift cluster. Write the logs to the EBS volume.
A
Explanation:
Problem Analysis:
The company must log all user activities and connection activities in Amazon Redshift for security compliance.
Key Considerations:
Redshift supports audit logging, which can be configured to write logs to an S3 bucket.
S3 provides durable, scalable, and cost-effective storage for logs.
Solution Analysis:
Option A: S3 for Logging
Standard approach for storing Redshift logs.
Easy to set up and manage with minimal cost.
Option B: Amazon EFS
EFS is unnecessary for this use case and less cost-efficient than S3.
Option C: Aurora MySQL
Using a database to store logs increases complexity and cost.
Option D: EBS Volume
EBS is not a scalable option for log storage compared to S3.
Final Recommendation:
Enable Redshift audit logging and specify an S3 bucket as the destination.
Amazon Redshift Audit Logging
Storing Logs in Amazon S3
A company uses an on-premises Microsoft SQL Server database to store financial transaction data. The company migrates the transaction data from the on-premises database to AWS at the end of each month. The company has noticed that the cost to migrate data from the on-premises database to an Amazon RDS for SQL Server database has increased recently.
The company requires a cost-effective solution to migrate the data to AWS. The solution must cause minimal downtown for the applications that access the database.
Which AWS service should the company use to meet these requirements?
- A . AWS Lambda
- B . AWS Database Migration Service (AWS DMS)
- C . AWS Direct Connect
- D . AWS DataSync
B
Explanation:
AWS Database Migration Service (AWS DMS) is a cloud service that makes it possible to migrate relational databases, data warehouses, NoSQL databases, and other types of data stores to AWS
quickly, securely, and with minimal downtime and zero data loss1. AWS DMS supports migration between 20-plus database and analytics engines, such as Microsoft SQL Server to Amazon RDS for SQL Server2. AWS DMS takes over many of the difficult or tedious tasks involved in a migration project, such as capacity analysis, hardware and software procurement, installation and administration, testing and debugging, and ongoing replication and monitoring1. AWS DMS is a cost-effective solution, as you only pay for the compute resources and additional log storage used during the migration process2. AWS DMS is the best solution for the company to migrate the financial transaction data from the on-premises Microsoft SQL Server database to AWS, as it meets the requirements of minimal downtime, zero data loss, and low cost.
Option A is not the best solution, as AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers, but it does not provide any built-in features for database migration. You would have to write your own code to extract, transform, and load the data from the source to the target, which would increase the operational overhead and complexity.
Option C is not the best solution, as AWS Direct Connect is a service that establishes a dedicated network connection from your premises to AWS, but it does not provide any built-in features for database migration. You would still need to use another service or tool to perform the actual data transfer, which would increase the cost and complexity.
Option D is not the best solution, as AWS DataSync is a service that makes it easy to transfer data between on-premises storage systems and AWS storage services, such as Amazon S3, Amazon EFS, and Amazon FSx for Windows File Server, but it does not support Amazon RDS for SQL Server as a target. You would have to use another service or tool to migrate the data from Amazon S3 to Amazon RDS for SQL Server, which would increase the latency and complexity.
Reference: Database Migration – AWS Database Migration Service – AWS
What is AWS Database Migration Service?
AWS Database Migration Service Documentation
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide
