Practice Free Amazon DEA-C01 Exam Online Questions
A company has an Amazon Redshift data warehouse that users access by using a variety of IAM roles.
More than 100 users access the data warehouse every day.
The company wants to control user access to the objects based on each user’s job role, permissions, and how sensitive the data is.
Which solution will meet these requirements?
- A . Use the role-based access control (RBAC) feature of Amazon Redshift.
- B . Use the row-level security (RLS) feature of Amazon Redshift.
- C . Use the column-level security (CLS) feature of Amazon Redshift.
- D . Use dynamic data masking policies in Amazon Redshift.
A
Explanation:
Amazon Redshift supports Role-Based Access Control (RBAC) to manage access to database objects. RBAC allows administrators to create roles for job functions and assign privileges at the schema, table, or column level based on data sensitivity and user roles.
“RBAC in Amazon Redshift helps manage permissions more efficiently at scale by assigning users to roles that reflect their job function. It simplifies user management and secures access based on job role and data sensitivity.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf
RBAC is preferred over RLS or CLS alone because it offers a more comprehensive and scalable solution across multiple users and permissions.
A company needs to optimize storage for an Amazon S3 bucket. Objects older than 1 year must be accessible within 5 hours. All versions of the objects must be retained and immutable for 7 years. All versions of the objects must use the write-once-read-many (WORM) model.
Which solution will meet these requirements?
- A . Configure S3 Versioning on the bucket and use the S3 Intelligent-Tiering storage class. Configure a lifecycle policy for the bucket to transition objects that are older than 1 year to S3 Glacier Flexible Retrieval. Configure the policy to delete objects that are older than 7 years.
- B . Configure S3 Object Lock on the bucket and use the S3 Intelligent-Tiering storage class. Configure a lifecycle policy for the bucket to transition objects that are older than 1 year to S3 Glacier Deep Archive. Configure the policy to delete objects that are older than 7 years.
- C . Configure S3 Object Lock on the bucket and use the S3 Intelligent-Tiering storage class. Configure a lifecycle policy for the bucket to transition objects that are older than 1 year to S3 Glacier Flexible Retrieval. Configure the policy to delete objects that are older than 7 years.
- D . Configure S3 Versioning on the bucket and use the S3 Intelligent-Tiering storage class. Configure a lifecycle policy for the bucket to transition objects that are older than 1 year to S3 Glacier Deep Archive. Configure the policy to delete objects that are older than 7 years.
C
Explanation:
The requirement to retain immutable object versions using a WORM model mandates the use of S3 Object Lock, not just S3 Versioning. Object Lock enforces immutability by preventing deletion or modification of objects for a specified retention period, which is required for compliance scenarios such as regulatory data retention.
To meet the access requirement of retrieving data within 5 hours, the appropriate archival storage class is S3 Glacier Flexible Retrieval, which supports expedited and standard retrieval within minutes to hours. S3 Glacier Deep Archive does not meet this requirement because retrieval times are typically 12 hours or longer.
Using S3 Intelligent-Tiering for active data optimizes storage costs automatically before transitioning older objects. A lifecycle policy can then transition objects older than 1 year to Glacier Flexible Retrieval and delete them after the 7-year retention period expires.
Options that rely only on S3 Versioning do not provide immutability guarantees. Options using Glacier Deep Archive fail the access-time requirement.
Therefore, Option C is the only solution that satisfies WORM compliance, retrieval-time constraints, and cost optimization.
A company stores server logs in an Amazon 53 bucket. The company needs to keep the logs for 1 year. The logs are not required after 1 year.
A data engineer needs a solution to automatically delete logs that are older than 1 year.
Which solution will meet these requirements with the LEAST operational overhead?
- A . Define an S3 Lifecycle configuration to delete the logs after 1 year.
- B . Create an AWS Lambda function to delete the logs after 1 year.
- C . Schedule a cron job on an Amazon EC2 instance to delete the logs after 1 year.
- D . Configure an AWS Step Functions state machine to delete the logs after 1 year.
B
Explanation:
Problem Analysis:
The company uses AWS Glue for ETL pipelines and requires automatic data quality checks during pipeline execution.
The solution must integrate with existing AWS Glue pipelines and evaluate data quality rules based on predefined thresholds.
Key Considerations:
Ensure minimal implementation effort by leveraging built-in AWS Glue features.
Use a standardized approach for defining and evaluating data quality rules.
Avoid custom libraries or external frameworks unless absolutely necessary.
Solution Analysis:
Option A: SQL Transform
Adding SQL transforms to define and evaluate data quality rules is possible but requires writing complex queries for each rule.
Increases operational overhead and deviates from Glue’s declarative approach.
Option B: Evaluate Data Quality Transform with DQDL
AWS Glue provides a built-in Evaluate Data Quality transform.
Allows defining rules in Data Quality Definition Language (DQDL), a concise and declarative way to define quality checks.
Fully integrated with Glue Studio, making it the least effort solution.
Option C: Custom Transform with PyDeequ
PyDeequ is a powerful library for data quality checks but requires custom code and integration.
Increases implementation effort compared to Glue’s native capabilities.
Option D: Custom Transform with Great Expectations
Great Expectations is another powerful library for data quality but adds complexity and external dependencies.
Final Recommendation:
Use Evaluate Data Quality transform in AWS Glue.
Define rules in DQDL for checking thresholds, null values, or other quality criteria.
This approach minimizes development effort and ensures seamless integration with AWS Glue.
AWS Glue Data Quality Overview
DQDL Syntax and Examples
Glue Studio Transformations
A data engineer needs to make tabular data available in an Amazon S3Cbased data lake. Users must be able to query the data by using SQL queries in Amazon Redshift, Amazon Athena, and Amazon EMR. The data is updated daily. The data engineer must ensure that updates and deletions are reflected in the data lake.
Which solution will meet these requirements with the LEAST operational overhead?
- A . Store the data in S3 Standard. Configure Apache Hudi with merge-on-read in Amazon EMR. Use Apache Spark SQL in Amazon EMR to perform the daily updates and deletions. Use Amazon EMR to schedule compaction jobs. Use AWS Glue to create a data catalog of Hudi tables that are stored in Amazon S3.
- B . Create S3 tables for the tabular data. Use AWS Glue and an S3 tables catalog for Apache Iceberg JAR to perform the daily updates and deletions. Configure a compaction size target. Set up snapshot management and unreferenced file removal for the S3 tables bucket.
- C . Load the data into an Amazon Redshift cluster. Use SQL to perform the daily updates and deletions. Upload the data to an Amazon S3 bucket in Apache Parquet format to create the data lake.
- D . Load the data into an Amazon EMR cluster. Use Apache Spark to perform the daily updates and deletions. Upload the data into an Amazon S3 bucket in Apache Parquet format to create the data
lake.
B
Explanation:
Apache Iceberg is a table format designed for large-scale data lakes that supports ACID transactions, schema evolution, time travel, and row-level updates and deletes. Using S3 Tables with Apache Iceberg provides a fully managed experience that integrates natively with Amazon Athena, Amazon Redshift, and Amazon EMR.
By using AWS Glue with the Iceberg catalog, the data engineer can perform daily updates and deletions without managing Spark clusters, compaction scheduling, or metadata cleanup manually. Iceberg handles snapshots, file pruning, and unreferenced file removal automatically, significantly reducing operational overhead.
Apache Hudi requires Amazon EMR clusters, Spark jobs, and manual compaction orchestration, increasing complexity. The Parquet-only approaches in options C and D do not support updates or deletes efficiently and would require full rewrites of datasets, which is not scalable.
Therefore, using S3 Tables with Apache Iceberg provides the most efficient, scalable, and low-maintenance solution that satisfies all query and update requirements.
A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded.
A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB.
How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?
- A . Use a second Lambda function to invoke the first Lambda function based on Amazon CloudWatch events.
- B . Use the Amazon Redshift Data API to publish an event to Amazon EventBridqe. Configure an EventBridge rule to invoke the Lambda function.
- C . Use the Amazon Redshift Data API to publish a message to an Amazon Simple Queue Service (Amazon SQS) queue. Configure the SQS queue to invoke the Lambda function.
- D . Use a second Lambda function to invoke the first Lambda function based on AWS CloudTrail events.
C
Explanation:
The Amazon Redshift Data API enables you to interact with your Amazon Redshift data warehouse in an easy and secure way. You can use the Data API to run SQL commands, such as loading data into tables, without requiring a persistent connection to the cluster. The Data API also integrates with Amazon EventBridge, which allows you to monitor the execution status of your SQL commands and trigger actions based on events. By using the Data API to publish an event to EventBridge, the data engineer can invoke the Lambda function that writes the load statuses to the DynamoDB table. This solution is scalable, reliable, and cost-effective. The other options are either not possible or not optimal. You cannot use a second Lambda function to invoke the first Lambda function based on CloudWatch or CloudTrail events, as these services do not capture the load status of Redshift tables. You can use the Data API to publish a message to an SQS queue, but this would require additional configuration and polling logic to invoke the Lambda function from the queue. This would also introduce additional latency and cost.
Reference: Using the Amazon Redshift Data API
Using Amazon EventBridge with Amazon Redshift
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide, Chapter 2: Data Store
Management, Section 2.2: Amazon Redshift
A company wants to migrate data from an Amazon RDS for PostgreSQL DB instance in the eu-east-1 Region of an AWS account named Account_A. The company will migrate the data to an Amazon Redshift cluster in the eu-west-1 Region of an AWS account named Account_B.
Which solution will give AWS Database Migration Service (AWS DMS) the ability to replicate data between two data stores?
- A . Set up an AWS DMS replication instance in Account_B in eu-west-1.
- B . Set up an AWS DMS replication instance in Account_B in eu-east-1.
- C . Set up an AWS DMS replication instance in a new AWS account in eu-west-1
- D . Set up an AWS DMS replication instance in Account_A in eu-east-1.
A
Explanation:
To migrate data from an Amazon RDS for PostgreSQL DB instance in the eu-east-1 Region (Account_A) to an Amazon Redshift cluster in the eu-west-1 Region (Account_B), AWS DMS needs a replication instance located in the target region (in this case, eu-west-1) to facilitate the data transfer between regions.
Option A: Set up an AWS DMS replication instance in Account_B in eu-west-1. Placing the DMS replication instance in the target account and region (Account_B in eu-west-1) is the most efficient solution. The replication instance can connect to the source RDS PostgreSQL in eu-east-1 and migrate the data to the Redshift cluster in eu-west-1. This setup ensures data is replicated across AWS accounts and regions.
Options B, C, and D place the replication instance in either the wrong account or region, which increases complexity without adding any benefit.
Reference: AWS Database Migration Service (DMS) Documentation
Cross-Region and Cross-Account Replication
Two developers are working on separate application releases. The developers have created feature branches named Branch A and Branch B by using a GitHub repository’s master branch as the source.
The developer for Branch A deployed code to the production system. The code for Branch B will merge into a master branch in the following week’s scheduled application release.
Which command should the developer for Branch B run before the developer raises a pull request to the master branch?
- A . git diff branchB mastergit commit -m <message>
- B . git pull master
- C . git rebase master
- D . git fetch -b master
C
Explanation:
To ensure that Branch B is up to date with the latest changes in the master branch before submitting a pull request, the correct approach is to perform a git rebase. This command rewrites the commit history so that Branch B will be based on the latest changes in the master branch.
git rebase master:
This command moves the commits of Branch B to be based on top of the latest state of the master branch. It allows the developer to resolve any conflicts and create a clean history.
Reference: Git Rebase Documentation
Alternatives Considered:
A (git diff): This will only show differences between Branch B and master but won’t resolve conflicts or bring Branch B up to date.
B (git pull master): Pulling the master branch directly does not offer the same clean history management as rebase.
D (git fetch -b): This is an incorrect command.
Reference: Git Rebase Best Practices
A company needs to store semi-structured transactional data in a serverless database.
The application writes data infrequently but reads it frequently, with millisecond retrieval required.
- A . Store the data in an Amazon S3 Standard bucket. Enable S3 Transfer Acceleration.
- B . Store the data in an Amazon S3 Apache Iceberg table. Enable S3 Transfer Acceleration.
- C . Store the data in an Amazon RDS for MySQL cluster. Configure RDS Optimized Reads.
- D . Store the data in an Amazon DynamoDB table. Configure a DynamoDB Accelerator (DAX) cache.
D
Explanation:
Amazon DynamoDB is a serverless, low-latency, NoSQL database ideal for semi-structured data.
Adding DynamoDB Accelerator (DAX) provides microsecond response times for read-heavy workloads.
“For applications requiring millisecond or sub-millisecond reads with serverless operation, use DynamoDB with DAX caching.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf
A data engineer is using an AWS Glue ETL job to remove outdated customer records from a table that contains customer account information.
The data engineer is using the following SQL command:
MERGE INTO accounts t USING monthly_accounts_update s
ON t.customer = s.customer
WHEN MATCHED THEN DELETE
What will happen when the data engineer runs the SQL command?
- A . All customer records that exist in both the customer accounts table and the monthly_accounts_update table will be deleted from the accounts table.
- B . Only customer records that are present in both tables will be retained in the customer accounts table.
- C . The monthly_accounts_update table will be deleted.
- D . No records will be deleted because the command syntax is not valid in AWS Glue.
A
Explanation:
In AWS Glue’s SQL implementation (Spark SQL-compatible), the MERGE INTO statement supports
conditional actions.
The clause WHEN MATCHED THEN DELETE deletes matching records from the target table (accounts) where the join condition is true.
“A MERGE INTO statement can perform updates, inserts, or deletes based on the match condition between source and target tables.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf
A company stores a large dataset in an Amazon S3 bucket. A data engineer frequently runs complex queries on the dataset by using Amazon Athena. The data engineer needs to optimize query performance and optimize costs for queries that are run multiple times with the same parameters.
Which solution will meet these requirements?
- A . Convert the dataset to JSON format before running Athena queries.
- B . Use Amazon EMR to pre-process the data before running Athena queries.
- C . Configure query result reuse settings in the Athena workgroup.
- D . Use Amazon Redshift Spectrum to query the data in Amazon S3.
C
Explanation:
Option C is correct because Amazon Athena query result reuse is specifically designed for repeated queries that use the same SQL text and parameters. AWS documentation states that when you rerun a query in Athena, you can choose to reuse the last stored query result, which can increase performance and reduce costs by lowering the number of bytes scanned. AWS also notes that you can configure a maximum age for reused results and that this feature is controlled through Athena settings, including the workgroup configuration. This directly matches the requirement to optimize both performance and cost for queries that are run repeatedly with the same parameters.
Option A is incorrect because JSON is generally less efficient for Athena analytics than optimized columnar formats.
Option B adds operational overhead and does not directly solve the specific repeated-query optimization requirement.
Option D changes the query engine rather than using Athena’s built-in optimization for repeated identical queries. The official AWS feature purpose lines up exactly with this use case, so query result reuse in the Athena workgroup is the best answer.
