Practice Free Amazon DEA-C01 Exam Online Questions
A media company uploads large video files to Amazon S3 for processing. After processing, the company needs to keep the original files for 90 days in case the files require reprocessing. After 90 days, the company can delete the files to reduce storage costs. The company stores the processed videos in a different S3 bucket.
Which S3 Lifecycle configuration will meet these requirements for the original files MOST cost-effectively?
- A . Store the files in S3 Standard for 90 days. Transition the files to S3 Glacier Flexible Retrieval for long-term storage. Then expire the files.
- B . Store the files in S3 Standard for 90 days. Enable versioning. Enable Object Lock on the files for 90 days. Then expire the files.
- C . Store the files in S3 Standard for 90 days. Implement S3 Lifecycle management to expire the files.
- D . Store the files in S3 Intelligent-Tiering for 90 days. Enable versioning. Add S3 Lifecycle management to expire the files.
C
Explanation:
Option C is correct because the company only needs to keep the original files for 90 days and then delete them. AWS states that S3 Lifecycle can be used to delete expired objects automatically, which directly matches this requirement. There is no need to transition the objects to another storage class because the files are not needed beyond day 90. Adding a transition step would add complexity and, in this case, would not improve cost-effectiveness.
Option A is not optimal because S3 Glacier Flexible Retrieval has a 90-day minimum storage duration and is intended for longer-term archival. Transitioning objects right before deleting them is unnecessary overhead and can add minimum-duration charges.
Option B is incorrect because Object Lock is for retention and write-once-read-many protection, not for normal temporary retention with lowest cost.
Option D is also less cost-effective because S3 Intelligent-Tiering is useful when access patterns are uncertain, but here the retention period and deletion timing are already known. The simplest and cheapest solution is to keep the originals in S3 Standard for the required 90 days and then expire them with Lifecycle.
A media company uses software as a service (SaaS) applications to gather data by using third-party tools. The company needs to store the data in an Amazon S3 bucket. The company will use Amazon Redshift to perform analytics based on the data.
Which AWS service or feature will meet these requirements with the LEAST operational overhead?
- A . Amazon Managed Streaming for Apache Kafka (Amazon MSK)
- B . Amazon AppFlow
- C . AWS Glue Data Catalog
- D . Amazon Kinesis
B
Explanation:
Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications and AWS services like Amazon S3 and Amazon Redshift. Amazon AppFlow supports many SaaS applications as data sources and targets, and allows you to configure data flows with a few clicks. Amazon AppFlow also provides features such as data transformation, filtering, validation, and encryption to prepare and protect your data. Amazon AppFlow meets the requirements of the media company with the least operational overhead, as it eliminates the need to write code, manage infrastructure, or monitor data pipelines.
Reference: Amazon AppFlow
Amazon AppFlow | SaaS Integrations List
Get started with data integration from Amazon S3 to Amazon Redshift using AWS Glue interactive sessions
A company uses Amazon S3 to store data and Amazon QuickSight to create visualizations.
The company has an S3 bucket in an AWS account named Hub-Account. The S3 bucket is encrypted with an AWS Key Management Service (AWS KMS) key. The company’s Amazon QuickSight instance is in a separate AWS account named BI-Account.
The company updates the S3 bucket policy to grant access to the QuickSight service role. The company wants to enable cross-account access to allow QuickSight to interact with the S3 bucket.
Which combination of steps will meet this requirement? (Select TWO)
- A . Use the existing AWS KMS key to encrypt connections from QuickSight to the S3 bucket.
- B . Add the S3 bucket as a resource that the QuickSight service role can access.
- C . Use AWS Resource Access Manager (AWS RAM) to share the S3 bucket with the BI-Account.
- D . Add an IAM policy to the QuickSight service role to give QuickSight access to the KMS key that encrypts the S3 bucket.
- E . Add the KMS key as a resource that the QuickSight service role can access.
BD
Explanation:
For Amazon QuickSight to access data in an Amazon S3 bucket that is encrypted with an AWS KMS key in a different AWS account, two distinct permissions are required: access to the S3 bucket and access to the KMS key.
First, the QuickSight service role in the BI-Account must be granted permission to access the S3 bucket. This is accomplished by adding the S3 bucket as a resource that the QuickSight service role can access and configuring the bucket policy in the Hub-Account to trust that role. Without explicit S3 permissions, QuickSight cannot read the objects.
Second, because the S3 bucket uses SSE-KMS encryption, QuickSight must also be authorized to use
the KMS key. This requires adding an IAM policy to the QuickSight service role that allows kms:Decrypt and related permissions on the KMS key. Without KMS permissions, S3 access will fail even if the bucket policy allows access.
AWS RAM cannot be used to share S3 buckets, encryption of network connections is irrelevant to KMS permissions, and adding the KMS key as a “resource” alone is not sufficient without IAM permissions.
Therefore, Options B and D are correct.
A company uses Amazon Redshift to store order transactions from the current day. The company has an orders table that contains the previous order data. The company also has a staging table that contains new or updated order records. The company needs to remove stale records from the orders table and insert the most recent data in the orders table from the staging table. Several downstream applications need the orders table to display up-to-date information.
Which solution will meet these requirements?
- A . Use Amazon Redshift Spectrum to delete stale records from the orders table and insert records from the staging table into the orders table.
- B . Unload the orders table and the staging table to Amazon S3. Delete stale orders table data and insert new staging table data in Amazon S3 by using Amazon Athena. Copy the orders S3 table to the orders Amazon Redshift table.
- C . Use Amazon Athena federated queries to read stale records from the orders table. Delete the stale records and insert the records from the staging table into the orders table.
- D . Write an Amazon Redshift stored procedure that deletes the stale records from the orders table and inserts new records from the staging table.
D
Explanation:
Option D is correct because Amazon Redshift stored procedures are designed to encapsulate a sequence of SQL statements and business logic inside the database. AWS documentation states that stored procedures are commonly used for data transformation, data validation, and business-specific logic, and that they can combine multiple SQL steps into one procedure. AWS also documents the standard Redshift pattern for deleting stale rows and inserting fresh rows from a staging table, which is exactly the requirement here. Keeping the operation inside Redshift is the most direct way to maintain an up-to-date orders table for downstream consumers.
Option A is incorrect because Redshift Spectrum is for querying external data in S3, not for performing this in-place Redshift table-maintenance pattern.
Option B adds unnecessary unload and reload steps, creating delay and operational complexity.
Option C is also unsuitable because Athena federated queries are not the right mechanism for transactional maintenance of Redshift tables. The correct DEA-C01-style answer is to use Redshift-native procedural SQL to delete stale rows and insert current rows from staging.
A company is developing machine learning (ML) models. A data engineer needs to apply data quality rules to training data. The company stores the training data in an Amazon S3 bucket.
- A . Create an AWS Lambda function to check data quality and to raise exceptions in the code.
- B . Create an AWS Glue DataBrew project for the data in the S3 bucket. Create a ruleset for the data quality rules. Create a profile job to run the data quality rules. Use Amazon EventBridge to run the profile job when data is added to the S3 bucket.
- C . Create an Amazon EMR provisioned cluster. Add a Python data quality package.
- D . Create AWS Lambda functions to evaluate data quality rules and orchestrate with AWS Step Functions.
B
Explanation:
AWS Glue DataBrew provides a no-code way to define and run data quality rulesets for data stored in S3. You can trigger profiling jobs via Amazon EventBridge on new uploads for automated checks.
“Use AWS Glue DataBrew to define and run data quality rules on S3 datasets with minimal coding effort. Automate validation by triggering jobs through EventBridge.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf
A data engineer is configuring Amazon SageMaker Studio to use AWS Glue interactive sessions to prepare data for machine learning (ML) models.
The data engineer receives an access denied error when the data engineer tries to prepare the data by using SageMaker Studio.
Which change should the engineer make to gain access to SageMaker Studio?
- A . Add the AWSGlueServiceRole managed policy to the data engineer’s IAM user.
- B . Add a policy to the data engineer’s IAM user that includes the sts:AssumeRole action for the AWS Glue and SageMaker service principals in the trust policy.
- C . Add the AmazonSageMakerFullAccess managed policy to the data engineer’s IAM user.
- D . Add a policy to the data engineer’s IAM user that allows the sts:AddAssociation action for the AWS Glue and SageMaker service principals in the trust policy.
B
Explanation:
This solution meets the requirement of gaining access to SageMaker Studio to use AWS Glue interactive sessions. AWS Glue interactive sessions are a way to use AWS Glue DataBrew and AWS Glue Data Catalog from within SageMaker Studio. To use AWS Glue interactive sessions, the data engineer’s IAM user needs to have permissions to assume the AWS Glue service role and the SageMaker execution role. By adding a policy to the data engineer’s IAM user that includes the sts:AssumeRole action for the AWS Glue and SageMaker service principals in the trust policy, the data engineer can grant these permissions and avoid the access denied error. The other options are not sufficient or necessary to resolve the error.
Reference: Get started with data integration from Amazon S3 to Amazon Redshift using AWS Glue interactive sessions
Troubleshoot Errors – Amazon SageMaker
AccessDeniedException on sagemaker:CreateDomain in AWS SageMaker Studio, despite having SageMakerFullAccess
A healthcare company uses Amazon Kinesis Data Streams to stream real-time health data from wearable devices, hospital equipment, and patient records.
A data engineer needs to find a solution to process the streaming data. The data engineer needs to store the data in an Amazon Redshift Serverless warehouse. The solution must support near real-time analytics of the streaming data and the previous day’s data.
Which solution will meet these requirements with the LEAST operational overhead?
- A . Load data into Amazon Kinesis Data Firehose. Load the data into Amazon Redshift.
- B . Use the streaming ingestion feature of Amazon Redshift.
- C . Load the data into Amazon S3. Use the COPY command to load the data into Amazon Redshift.
- D . Use the Amazon Aurora zero-ETL integration with Amazon Redshift.
B
Explanation:
The streaming ingestion feature of Amazon Redshift enables you to ingest data from streaming sources, such as Amazon Kinesis Data Streams, into Amazon Redshift tables in near real-time. You can use the streaming ingestion feature to process the streaming data from the wearable devices, hospital equipment, and patient records. The streaming ingestion feature also supports incremental updates, which means you can append new data or update existing data in the Amazon Redshift tables. This way, you can store the data in an Amazon Redshift Serverless warehouse and support near real-time analytics of the streaming data and the previous day’s data. This solution meets the requirements with the least operational overhead, as it does not require any additional services or components to ingest and process the streaming data. The other options are either not feasible or not optimal. Loading data into Amazon Kinesis Data Firehose and then into Amazon Redshift (option A) would introduce additional latency and cost, as well as require additional configuration and management. Loading data into Amazon S3 and then using the COPY command to load the data into Amazon Redshift (option C) would also introduce additional latency and cost, as well as require additional storage space and ETL logic. Using the Amazon Aurora zero-ETL integration with Amazon Redshift (option D) would not work, as it requires the data to be stored in Amazon Aurora first, which is not the case for the streaming data from the healthcare company.
Reference: Using streaming ingestion with Amazon Redshift
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide, Chapter 3: Data Ingestion
and Transformation, Section 3.5: Amazon Redshift Streaming Ingestion
A company uses a variety of AWS and third-party data stores. The company wants to consolidate all the data into a central data warehouse to perform analytics. Users need fast response times for analytics queries.
The company uses Amazon QuickSight in direct query mode to visualize the data. Users normally run queries during a few hours each day with unpredictable spikes.
Which solution will meet these requirements with the LEAST operational overhead?
- A . Use Amazon Redshift Serverless to load all the data into Amazon Redshift managed storage (RMS).
- B . Use Amazon Athena to load all the data into Amazon S3 in Apache Parquet format.
- C . Use Amazon Redshift provisioned clusters to load all the data into Amazon Redshift managed storage (RMS).
- D . Use Amazon Aurora PostgreSQL to load all the data into Aurora.
A
Explanation:
Problem Analysis:
The company requires a centralized data warehouse for consolidating data from various sources.
They use Amazon QuickSight in direct query mode, necessitating fast response times for analytical queries.
Users query the data intermittently, with unpredictable spikes during the day.
Operational overhead should be minimal.
Key Considerations:
The solution must support fast, SQL-based analytics.
It must handle unpredictable spikes efficiently.
Must integrate seamlessly with QuickSight for direct querying. Minimize operational complexity and scaling concerns. Solution Analysis:
Option A: Amazon Redshift Serverless
Redshift Serverless eliminates the need for provisioning and managing clusters.
Automatically scales compute capacity up or down based on query demand.
Reduces operational overhead by handling performance optimization.
Fully integrates with Amazon QuickSight, ensuring low-latency analytics.
Reduces costs as it charges only for usage, making it ideal for workloads with intermittent spikes.
Option B: Amazon Athena with S3 (Apache Parquet)
Athena supports querying data directly from S3 in Parquet format.
While it’s cost-effective, performance depends on the size and complexity of the data. It is not optimized for high-speed analytics needed by QuickSight in direct query mode.
Option C: Amazon Redshift Provisioned Clusters
Requires manual cluster provisioning, scaling, and maintenance. Higher operational overhead compared to Redshift Serverless.
Option D: Amazon Aurora PostgreSQL
Aurora is optimized for transactional databases, not data warehousing or analytics. Does not meet the requirement for fast analytics queries. Final Recommendation:
Amazon Redshift Serverless is the best choice for this use case because it provides fast analytics, integrates natively with QuickSight, and minimizes operational complexity while efficiently handling unpredictable spikes.
Amazon Redshift Serverless Overview
Amazon QuickSight and Redshift Integration
Athena vs. Redshift
A company stores historical customer data in an Amazon Redshift table. A column named Email contains null entries and values that are not email addresses. The quality of the Email column is critical for multiple downstream processes. A data engineer must create an AWS Glue Data Quality rule that fails when the percentage of valid email addresses in the Email column is less than 90%.
Which component of an AWS Glue Data Quality rule will meet these requirements?
- A . Uniqueness "Email" matches with a threshold set to > 0.9
- B . Column Values "Email" matches with a threshold set to > 0.1
- C . Column Values "Email" matches with a threshold set to > 0.9
- D . Unique Value Ratio "Email" matches with a threshold set to > 0.1
C
Explanation:
Option C is correct because the requirement is to verify that at least 90% of values in the Email column are valid email addresses. In AWS Glue Data Quality, rules are written in DQDL, and Column Values is the rule component used to validate whether column values satisfy a condition or pattern. AWS Glue Data Quality documentation explains that DQDL is the language used to define rules, and it specifically notes that Column Values rules can evaluate column content and that NULL values do not pass during comparisons. That is important here because the problem explicitly says the column contains null entries and invalid email values. A threshold of > 0.9 means the rule passes
only when more than 90% of rows satisfy the email-validation condition.
Option A is incorrect because Uniqueness checks the percentage of values that are unique, not whether values are valid email addresses. AWS documents Uniqueness as measuring how many values occur exactly once, which is unrelated to email-format validity.
Option D has the same issue because Unique Value Ratio addresses uniqueness characteristics, not format validation.
Option B uses the right rule family but the wrong threshold because > 0.1 would require only 10% valid values. Therefore, Column Values with a threshold greater than 0.9 is the correct choice.
A company stores sales data in an Amazon RDS for MySQL database. The company needs to start a reporting process between 6:00 A.M. and 6:10 A.M. every Monday. The reporting process must generate a CSV file and store the file in an Amazon S3 bucket.
Which combination of steps will meet these requirements with the LEAST operational overhead? (Select TWO.)
- A . Create an Amazon EventBridge rule to run every Monday at 6:00 A.M.
- B . Create an Amazon EventBridge Scheduler to run every Monday at 6:00 A.M.
- C . Create and invoke an AWS Batch job that runs a script in an Amazon Elastic Container Service (Amazon ECS) container. Configure the script to generate the report and to save it to the S3 bucket.
- D . Create and invoke an AWS Glue ETL job to generate the report and to save it to the S3 bucket.
- E . Create and invoke an Amazon EMR Serverless job to generate the report and to save it to the S3 bucket.
B,D
Explanation:
The Amazon EventBridge Scheduler offers a simple, serverless cron-based execution mechanism. It can trigger an AWS Glue ETL job that extracts data from Amazon RDS, formats it as CSV, and writes it to Amazon S3 ― all without manual orchestration or servers.
“For scheduled data extraction and transformation, use AWS Glue jobs triggered by EventBridge Scheduler for fully managed, low-maintenance workflows.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf
Glue natively integrates with RDS and S3, avoiding the need to manage Batch or EMR infrastructure.
