Practice Free Amazon DEA-C01 Exam Online Questions
A company stores sensitive data in an Amazon Redshift table. The company needs to give specific users the ability to access the sensitive data. The company must not create duplication in the data. Customer support users must be able to see the last four characters of the sensitive data. Audit users must be able to see the full value of the sensitive data. No other users can have the ability to access the sensitive information.
Which solution will meet these requirements?
- A . Create a dynamic data masking policy to allow access based on each user role. Create IAM roles that have specific access permissions. Attach the masking policy to the column that contains sensitive data.
- B . Enable metadata security on the Redshift cluster. Create IAM users and IAM roles for the customer support users and the audit users. Grant the IAM users and IAM roles permissions to view the metadata in the Redshift cluster.
- C . Create a row-level security policy to allow access based on each user role. Create IAM roles that have specific access permissions. Attach the security policy to the table.
- D . Create an AWS Glue job to redact the sensitive data and to load the data into a new Redshift table.
A
Explanation:
Option A is correct because Amazon Redshift dynamic data masking is designed to hide, obfuscate, or partially reveal sensitive values at query time without duplicating the underlying data. AWS states that masking policies can be attached to one or more columns and can be applied differently to certain users or roles. AWS also provides examples where different roles see different versions of the same column value, which matches the requirement that customer support users see only the last four characters while audit users can see the full value.
Option B is incorrect because metadata security controls metadata visibility, not column-value masking.
Option C is incorrect because row-level security filters which rows are visible; it does not selectively mask part of a column value.
Option D would create a second table and therefore duplicates data, which the question forbids. Dynamic data masking is the AWS-native feature that satisfies role-based partial exposure of a single sensitive column with no data duplication.
A company is building an inventory management system and an inventory reordering system to automatically reorder products. Both systems use Amazon Kinesis Data Streams. The inventory management system uses the Amazon Kinesis Producer Library (KPL) to publish data to a stream. The inventory reordering system uses the Amazon Kinesis Client Library (KCL) to consume data from the stream. The company configures the stream to scale up and down as needed.
Before the company deploys the systems to production, the company discovers that the inventory reordering system received duplicated data.
Which factors could have caused the reordering system to receive duplicated data? (Select TWO.)
- A . The producer experienced network-related timeouts.
- B . The stream’s value for the Iterator Age Milliseconds metric was too high.
- C . There was a change in the number of shards, record processors, or both.
- D . The Aggregation Enabled configuration property was set to true.
- E . The max_records configuration property was set to a number that was too high.
A,C
Explanation:
Problem Analysis:
The company uses Kinesis Data Streams for both inventory management and reordering.
The Kinesis Producer Library (KPL) publishes data, and the Kinesis Client Library (KCL) consumes data.
Duplicate records were observed in the inventory reordering system.
Key Considerations:
Kinesis streams are designed for durability but may produce duplicates under certain conditions.
Factors such as network timeouts, shard splits, or changes in record processors can cause duplication.
Solution Analysis:
Option A: Network-Related Timeouts
If the producer (KPL) experiences network timeouts, it retries data submission, potentially causing duplicates.
Option B: High Iterator Age Milliseconds
High iterator age suggests delays in processing but does not directly cause duplication.
Option C: Changes in Shards or Processors
Changes in the number of shards or record processors can lead to re-processing of records, causing duplication.
Option D: Aggregation Enabled Set to True
Aggregation Enabled controls the aggregation of multiple records into one, but it does not cause duplication.
Option E: High max_records Value
A high max_records value increases batch size but does not lead to duplication.
Final Recommendation:
Network-related timeouts and changes in shards or processors are the most likely causes of duplicate data in this scenario.
Amazon Kinesis Data Streams Best Practices
Kinesis Producer Library (KPL) Overview
Kinesis Client Library (KCL) Overview
A company has multiple applications that use datasets that are stored in an Amazon S3 bucket. The company has an ecommerce application that generates a dataset that contains personally identifiable information (PII). The company has an internal analytics application that does not require access to the PII.
To comply with regulations, the company must not share PII unnecessarily. A data engineer needs to implement a solution that with redact PII dynamically, based on the needs of each application that accesses the dataset.
Which solution will meet the requirements with the LEAST operational overhead?
- A . Create an S3 bucket policy to limit the access each application has. Create multiple copies of the dataset. Give each dataset copy the appropriate level of redaction for the needs of the application that accesses the copy.
- B . Create an S3 Object Lambda endpoint. Use the S3 Object Lambda endpoint to read data from the S3 bucket. Implement redaction logic within an S3 Object Lambda function to dynamically redact PII based on the needs of each application that accesses the data.
- C . Use AWS Glue to transform the data for each application. Create multiple copies of the dataset. Give each dataset copy the appropriate level of redaction for the needs of the application that accesses the copy.
- D . Create an API Gateway endpoint that has custom authorizers. Use the API Gateway endpoint to read data from the S3 bucket. Initiate a REST API call to dynamically redact PII based on the needs of each application that accesses the data.
B
Explanation:
Option B is the best solution to meet the requirements with the least operational overhead because S3 Object Lambda is a feature that allows you to add your own code to process data retrieved from S3 before returning it to an application. S3 Object Lambda works with S3 GET requests and can modify both the object metadata and the object data. By using S3 Object Lambda, you can implement redaction logic within an S3 Object Lambda function to dynamically redact PII based on the needs of each application that accesses the data. This way, you can avoid creating and maintaining multiple copies of the dataset with different levels of redaction.
Option A is not a good solution because it involves creating and managing multiple copies of the dataset with different levels of redaction for each application. This option adds complexity and storage cost to the data protection process and requires additional resources and configuration. Moreover, S3 bucket policies cannot enforce fine-grained data access control at the row and column level, so they are not sufficient to redact PII.
Option C is not a good solution because it involves using AWS Glue to transform the data for each application. AWS Glue is a fully managed service that can extract, transform, and load (ETL) data from various sources to various destinations, including S3. AWS Glue can also convert data to different formats, such as Parquet, which is a columnar storage format that is optimized for analytics. However, in this scenario, using AWS Glue to redact PII is not the best option because it requires creating and maintaining multiple copies of the dataset with different levels of redaction for each application. This option also adds extra time and cost to the data protection process and requires additional resources and configuration.
Option D is not a good solution because it involves creating and configuring an API Gateway endpoint that has custom authorizers. API Gateway is a service that allows you to create, publish, maintain, monitor, and secure APIs at any scale. API Gateway can also integrate with other AWS services, such as Lambda, to provide custom logic for processing requests. However, in this scenario, using API Gateway to redact PII is not the best option because it requires writing and maintaining custom code and configuration for the API endpoint, the custom authorizers, and the REST API call. This option also adds complexity and latency to the data protection process and requires additional resources and configuration.
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide
Introducing Amazon S3 Object Lambda C Use Your Code to Process Data as It Is Being Retrieved from S3
Using Bucket Policies and User Policies – Amazon Simple Storage Service
AWS Glue Documentation
What is Amazon API Gateway? – Amazon API Gateway
A company uses an organization in AWS Organizations to manage multiple AWS accounts. The company uses an enhanced fanout data stream in Amazon Kinesis Data Streams to receive streaming data from multiple producers. The data stream runs in Account A. The company wants to use an AWS Lambda function in Account B to process the data from the stream. The company creates a Lambda execution role in Account B that has permissions to access data from the stream in Account A.
What additional step must the company take to meet this requirement?
- A . Create a service control policy (SCP) to grant the data stream read access to the cross-account Lambda execution role. Attach the SCP to Account A.
- B . Add a resource-based policy to the data stream to allow read access for the cross-account Lambda execution role.
- C . Create a service control policy (SCP) to grant the data stream read access to the cross-account Lambda execution role. Attach the SCP to Account B.
- D . Add a resource-based policy to the cross-account Lambda function to grant the data stream read
access to the function.
B
Explanation:
To allow cross-account access to a Kinesis Data Stream, you must add a resource-based policy to the Kinesis stream in Account A, explicitly granting the Lambda execution role in Account B the required permissions.
SCPs (A & C) set permissions boundaries, but do not grant access.
Option D incorrectly refers to the Lambda function C but the Kinesis resource must allow access.
“You must add a resource-based policy to the Kinesis Data Stream in Account A to allow a Lambda function in Account B to consume from the stream.”
Reference: AWS Documentation C Cross-account Lambda access to Kinesis
An ecommerce company wants to use AWS to migrate data pipelines from an on-premises environment into the AWS Cloud. The company currently uses a third-party too in the on-premises environment to orchestrate data ingestion processes.
The company wants a migration solution that does not require the company to manage servers. The solution must be able to orchestrate Python and Bash scripts. The solution must not require the company to refactor any code.
Which solution will meet these requirements with the LEAST operational overhead?
- A . AWS Lambda
- B . Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
- C . AWS Step Functions
- D . AWS Glue
B
Explanation:
The ecommerce company wants to migrate its data pipelines into the AWS Cloud without managing servers, and the solution must orchestrate Python and Bash scripts without refactoring code. Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is the most suitable solution for this scenario.
Option B: Amazon Managed Workflows for Apache Airflow (Amazon MWAA) MWAA is a managed orchestration service that supports Python and Bash scripts via Directed Acyclic Graphs (DAGs) for workflows. It is a serverless, managed version of Apache Airflow, which is commonly used for orchestrating complex data workflows, making it an ideal choice for migrating existing pipelines without refactoring. It supports Python, Bash, and other scripting languages, and the company would not need to manage the underlying infrastructure.
Other options:
AWS Lambda (Option A) is more suited for event-driven workflows but would require breaking down the pipeline into individual Lambda functions, which may require refactoring.
AWS Step Functions (Option C) is good for orchestration but lacks native support for Python and Bash without using Lambda functions, and it may require code changes.
AWS Glue (Option D) is an ETL service primarily for data transformation and not suitable for orchestrating general scripts without modification.
Reference: Amazon Managed Workflows for Apache Airflow (MWAA) Documentation
A data engineer needs to build an enterprise data catalog based on the company’s Amazon S3 buckets and Amazon RDS databases. The data catalog must include storage format metadata for the data in the catalog.
Which solution will meet these requirements with the LEAST effort?
- A . Use an AWS Glue crawler to scan the S3 buckets and RDS databases and build a data catalog. Use data stewards to inspect the data and update the data catalog with the data format.
- B . Use an AWS Glue crawler to build a data catalog. Use AWS Glue crawler classifiers to recognize the format of data and store the format in the catalog.
- C . Use Amazon Macie to build a data catalog and to identify sensitive data elements. Collect the data format information from Macie.
- D . Use scripts to scan data elements and to assign data classifications based on the format of the data.
A
Explanation:
To build an enterprise data catalog with metadata for storage formats, the easiest and most efficient solution is using an AWS Glue crawler. The Glue crawler can scan Amazon S3 buckets and Amazon RDS databases to automatically create a data catalog that includes metadata such as the schema and storage format (e.g., CSV, Parquet, etc.). By using AWS Glue crawler classifiers, you can configure the crawler to recognize the format of the data and store this information directly in the catalog.
Option B: Use an AWS Glue crawler to build a data catalog. Use AWS Glue crawler classifiers to recognize the format of data and store the format in the catalog. This option meets the requirements with the least effort because Glue crawlers automate the discovery and cataloging of data from multiple sources, including S3 and RDS, while recognizing various file formats via classifiers.
Other options (A, C, D) involve additional manual steps, like having data stewards inspect the data, or using services like Amazon Macie that focus more on sensitive data detection rather than format cataloging.
Reference: AWS Glue Crawler Documentation
AWS Glue Classifiers
A car sales company maintains data about cars that are listed for sale in an area. The company receives data about new car listings from vendors who upload the data daily as compressed files into Amazon S3. The compressed files are up to 5 KB in size. The company wants to see the most up-to-date listings as soon as the data is uploaded to Amazon S3.
A data engineer must automate and orchestrate the data processing workflow of the listings to feed a dashboard. The data engineer must also provide the ability to perform one-time queries and analytical reporting. The query solution must be scalable.
Which solution will meet these requirements MOST cost-effectively?
- A . Use an Amazon EMR cluster to process incoming data. Use AWS Step Functions to orchestrate workflows. Use Apache Hive for one-time queries and analytical reporting. Use Amazon OpenSearch Service to bulk ingest the data into compute optimized instances. Use OpenSearch Dashboards in OpenSearch Service for the dashboard.
- B . Use a provisioned Amazon EMR cluster to process incoming data. Use AWS Step Functions to orchestrate workflows. Use Amazon Athena for one-time queries and analytical reporting. Use Amazon QuickSight for the dashboard.
- C . Use AWS Glue to process incoming data. Use AWS Step Functions to orchestrate workflows. Use Amazon Redshift Spectrum for one-time queries and analytical reporting. Use OpenSearch Dashboards in Amazon OpenSearch Service for the dashboard.
- D . Use AWS Glue to process incoming data. Use AWS Lambda and S3 Event Notifications to orchestrate workflows. Use Amazon Athena for one-time queries and analytical reporting. Use Amazon QuickSight for the dashboard.
D
Explanation:
For processing the incoming car listings in a cost-effective, scalable, and automated way, the ideal approach involves using AWS Glue for data processing, AWS Lambda with S3 Event Notifications for orchestration, Amazon Athena for one-time queries and analytical reporting, and Amazon QuickSight for visualization on the dashboard. Let’s break this down:
AWS Glue: This is a fully managed ETL (Extract, Transform, Load) service that automatically processes the incoming data files. Glue is serverless and supports diverse data sources, including Amazon S3
and Redshift.
AWS Lambda and S3 Event Notifications: Using Lambda and S3 Event Notifications allows near real-time triggering of processing workflows as soon as new data is uploaded into S3. This approach is event-driven, ensuring that the listings are processed as soon as they are uploaded, reducing the latency for data processing.
Amazon Athena: A serverless, pay-per-query service that allows interactive queries directly against data in S3 using standard SQL. It is ideal for the requirement of one-time queries and analytical reporting without the need for provisioning or managing servers.
Amazon QuickSight: A business intelligence tool that integrates with a wide range of AWS data sources, including Athena, and is used for creating interactive dashboards. It scales well and provides real-time insights for the car listings.
This solution (Option D) is the most cost-effective, because both Glue and Athena are serverless and priced based on usage, reducing costs when compared to provisioning EMR clusters in the other options. Moreover, using Lambda for orchestration is more cost-effective than AWS Step Functions due to its lightweight nature.
Reference: AWS Glue Documentation
Amazon Athena Documentation
Amazon QuickSight Documentation
S3 Event Notifications and Lambda
A company runs an extract, transform, and load (ETL) job in AWS Glue. The job processes personally identifiable information (PII) data and writes logs to an Amazon CloudWatch Logs log group. A data engineer needs to mask PII data in the CloudWatch Logs log group.
Which solution will meet these requirements?
- A . Attach an AWS Glue security configuration to the ETL job.
- B . Configure a data protection policy. Attach the policy to the CloudWatch log group.
- C . Run an Amazon Macie sensitive data discovery job.
- D . Call AWS Glue sensitive data detection APIs in the ETL job.
B
Explanation:
Option B is the right approach because the requirement is to mask PII in the log destination (the CloudWatch Logs log group). The exam guide explicitly calls out security responsibilities that include “data encryption and masking” and also emphasizes enabling and preparing logs for audit and governance needs. A log-groupClevel masking mechanism is therefore the most direct control point to prevent sensitive values from being exposed to anyone who can view logs.
Option A (Glue security configuration) is primarily used to apply protections such as encryption settings for Glue jobs and related outputs; it does not inherently solve the problem of masking PII that has already been emitted into application logs.
Option C (Macie) is intended to discover and classify sensitive data―most commonly in Amazon S3―and to produce findings; it is not a log-masking control for CloudWatch Logs. The material reinforces Macie as a discovery/classification service for PII rather than a masking mechanism.
Option D adds custom code paths in the ETL job and still may miss PII that appears in framework/system logs; it also violates the “least operational effort” spirit compared to applying a centralized log policy.
Therefore, attaching a data protection policy directly to the CloudWatch Logs log group best meets the masking requirement.
A global ecommerce company processes customer transactions, inventory updates, and user activity logs across multiple AWS services. The company needs a scalable, fully managed, and event-driven orchestration solution to coordinate complex extract, transform, and load (ETL) workflows. The solution must use AWS Glue and Amazon EMR to process data. The data will be stored in Amazon Redshift and Amazon S3. The solution must support dependency management, automated retries, and data pipeline monitoring.
Which solution will meet these requirements?
- A . Use AWS Step Functions to define an express workflow that invokes the data transformation and
loading tasks across Amazon EMR and AWS Glue. - B . Create AWS Lambda functions for each step of the workflow. Configure Amazon EventBridge to invoke AWS Glue jobs. Configure the Lambda functions to process and move data through the pipeline.
- C . Use Apache Airflow on Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to create Directed Acyclic Graphs (DAGs) to manage ETL workflows.
- D . Create an AWS Lambda function that runs each step of the workflow. Create an Amazon EventBridge scheduled rule to invoke the function every day.
C
Explanation:
Option C is the best answer because Amazon MWAA is a fully managed Apache Airflow service built specifically for workflow orchestration. Airflow uses Directed Acyclic Graphs (DAGs) to model task dependencies, which directly satisfies the requirement for dependency management. Amazon MWAA also supports orchestration across AWS analytics services such as Amazon S3, Amazon Redshift, and Amazon EMR, making it a natural fit for complex ETL pipelines. AWS documentation describes workflow definitions, tasks, and Airflow-based orchestration as core MWAA concepts.
This choice is stronger than A because the question specifically asks for orchestration of complex ETL workflows with dependency management, retries, and monitoring across multiple data-processing systems. While AWS Step Functions is an orchestration service, the option specifies an express workflow, and AWS states that Express Workflows are designed for high-volume, short-duration runs and can run for only up to five minutes, which is often not the best fit for longer-running ETL jobs on EMR and Glue.
The uploaded study material also points to orchestration services for dependent ETL jobs and identifies both Step Functions and workflow orchestration concepts as core exam knowledge, but for this exact multi-step ETL orchestration pattern, MWAA with DAGs is the most precise match.
