Practice Free Amazon DEA-C01 Exam Online Questions
A company extracts approximately 1 TB of data every day from data sources such as SAP HANA,
Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. Some of the data sources have undefined data schemas or data schemas that change.
A data engineer must implement a solution that can detect the schema for these data sources. The solution must extract, transform, and load the data to an Amazon S3 bucket. The company has a service level agreement (SLA) to load the data into the S3 bucket within 15 minutes of data creation.
Which solution will meet these requirements with the LEAST operational overhead?
- A . Use Amazon EMR to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark.
- B . Use AWS Glue to detect the schema and to extract, transform, and load the data into the S3 bucket. Create a pipeline in Apache Spark.
- C . Create a PvSpark proqram in AWS Lambda to extract, transform, and load the data into the S3 bucket.
- D . Create a stored procedure in Amazon Redshift to detect the schema and to extract, transform, and load the data into a Redshift Spectrum table. Access the table from Amazon S3.
B
Explanation:
AWS Glue is a fully managed service that provides a serverless data integration platform. It can automatically discover and categorize data from various sources, including SAP HANA, Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. It can also infer the schema of the data and store it in the AWS Glue Data Catalog, which is a central metadata repository. AWS Glue can then use the schema information to generate and run Apache Spark code to extract, transform, and load the data into an Amazon S3 bucket. AWS Glue can also monitor and optimize the performance and cost of the data pipeline, and handle any schema changes that may occur in the source data. AWS Glue can meet the SLA of loading the data into the S3 bucket within 15 minutes of data creation, as it can trigger the data pipeline based on events, schedules, or on-demand. AWS Glue has the least operational overhead among the options, as it does not require provisioning, configuring, or managing any servers or clusters. It also handles scaling, patching, and security automatically.
Reference: AWS Glue
[AWS Glue Data Catalog]
[AWS Glue Developer Guide]
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide
A data engineer needs to create an empty copy of an existing table in Amazon Athena to perform data processing tasks. The existing table in Athena contains 1,000 rows.
Which query will meet this requirement?
- A . CREATE TABLE new_table LIKE old_table;
- B . CREATE TABLE new_table AS SELECT * FROM old_table WITH NO DATA;
- C . CREATE TABLE new_table AS SELECT * FROM old_table;
- D . CREATE TABLE new_table AS SELECT * FROM old_table WHERE 1=1;
B
Explanation:
In Amazon Athena, you can use CREATE TABLE AS SELECT with WITH NO DATA to create an empty copy of an existing table’s schema:
“The query CREATE TABLE new_table AS SELECT * FROM old_table WITH NO DATA; creates a new table with the same schema but without copying over the data.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf This is the most efficient way to create an empty version of the existing table.
A company stores sensitive transaction data in an Amazon S3 bucket. A data engineer must implement controls to prevent accidental deletions.
- A . Enable versioning on the S3 bucket and configure MFA delete.
- B . Configure an S3 bucket policy rule that denies the creation of S3 delete markers.
- C . Create an S3 Lifecycle rule that moves deleted files to S3 Glacier Deep Archive.
- D . Set up AWS Config remediation actions to prevent users from deleting S3 objects.
A
Explanation:
Versioning with MFA Delete protects against accidental or malicious deletions by requiring multi-factor authentication to permanently remove objects or versions.
“To protect data from accidental deletion, enable S3 Versioning and MFA Delete, which requires MFA for object deletions and prevents unintentional loss.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf This is the AWS best practice for securing critical S3 data.
A company runs multiple applications on AWS. The company configured each application to output logs. The company wants to query and visualize the application logs in near real time.
Which solution will meet these requirements?
- A . Configure the applications to output logs to Amazon CloudWatch Logs log groups. Create an Amazon S3 bucket. Create an AWS Lambda function that runs on a schedule to export the required log groups to the S3 bucket. Use Amazon Athena to query the log data in the S3 bucket.
- B . Create an Amazon OpenSearch Service domain. Configure the applications to output logs to Amazon CloudWatch Logs log groups. Create an OpenSearch Service subscription filter for each log group to stream the data to OpenSearch. Create the required queries and dashboards in OpenSearch Service to analyze and visualize the data.
- C . Configure the applications to output logs to Amazon CloudWatch Logs log groups. Use CloudWatch log anomaly detection to query and visualize the log data.
- D . Update the application code to send the log data to Amazon QuickSight by using Super-fast, Parallel, In-memory Calculation Engine (SPICE). Create the required analyses and dashboards in QuickSight.
B
Explanation:
The optimal solution for near-real-time querying and visualization of logs is to integrate Amazon CloudWatch Logs with Amazon OpenSearch Service using subscription filters, which stream the logs directly into OpenSearch for querying and dashboarding:
“Use OpenSearch Service with CloudWatch Logs and create a subscription filter to stream log data in near real time into OpenSearch. Then use OpenSearch dashboards for visualization.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf
This approach offers low latency and avoids batch exports, unlike the scheduled Athena + S3 pattern.
A mobile gaming company wants to capture data from its gaming app. The company wants to make the data available to three internal consumers of the data. The data records are approximately 20 KB in size.
The company wants to achieve optimal throughput from each device that runs the gaming app. Additionally, the company wants to develop an application to process data streams. The stream-processing application must have dedicated throughput for each internal consumer.
Which solution will meet these requirements?
- A . Configure the mobile app to call the PutRecords API operation to send data to Amazon Kinesis Data Streams. Use the enhanced fan-out feature with a stream for each internal consumer.
- B . Configure the mobile app to call the PutRecordBatch API operation to send data to Amazon Data Firehose. Submit an AWS Support case to turn on dedicated throughput for the company’s AWS account. Allow each internal consumer to access the stream.
- C . Configure the mobile app to use the Amazon Kinesis Producer Library (KPL) to send data to Amazon Data Firehose. Use the enhanced fan-out feature with a stream for each internal consumer.
- D . Configure the mobile app to call the PutRecords API operation to send data to Amazon Kinesis Data Streams. Host the stream-processing application for each internal consumer on Amazon EC2 instances. Configure auto scaling for the EC2 instances.
A
Explanation:
Problem Analysis:
Input Requirements: Gaming app generates approximately 20 KB data records, which must be ingested and made available to three internal consumers with dedicated throughput.
Key Requirements:
High throughput for ingestion from each device.
Dedicated processing bandwidth for each consumer.
Key Considerations:
Amazon Kinesis Data Streams supports high-throughput ingestion with PutRecords API for batch writes.
The Enhanced Fan-Out feature provides dedicated throughput to each consumer, avoiding bandwidth contention.
This solution avoids bottlenecks and ensures optimal throughput for the gaming application and consumers.
Solution Analysis:
Option A: Kinesis Data Streams + Enhanced Fan-Out
PutRecords API is designed for batch writes, improving ingestion performance.
Enhanced Fan-Out allows each consumer to process the stream independently with dedicated throughput.
Option B: Data Firehose + Dedicated Throughput Request
Firehose is not designed for real-time stream processing or fan-out. It delivers data to destinations like S3, Redshift, or OpenSearch, not multiple independent consumers.
Option C: Data Firehose + Enhanced Fan-Out
Firehose does not support enhanced fan-out. This option is invalid.
Option D: Kinesis Data Streams + EC2 Instances
Hosting stream-processing applications on EC2 increases operational overhead compared to native enhanced fan-out.
Final Recommendation:
Use Kinesis Data Streams with Enhanced Fan-Out for high-throughput ingestion and dedicated consumer bandwidth.
Kinesis Data Streams Enhanced Fan-Out
PutRecords API for Batch Writes
A company wants to combine data from multiple software as a service (SaaS) applications for analysis.
A data engineering team needs to use Amazon QuickSight to perform the analysis and build dashboards. A data engineer needs to extract the data from the SaaS applications and make the data available for QuickSight queries.
Which solution will meet these requirements in the MOST operationally efficient way?
- A . Create AWS Lambda functions that call the required APIs to extract the data from the applications. Store the data in an Amazon S3 bucket. Use AWS Glue to catalog the data in the S3 bucket. Create a data source and a dataset in QuickSight
- B . Use AWS Lambda functions as Amazon Athena data source connectors to run federated queries against the SaaS applications. Create an Athena data source and a dataset in QuickSight.
- C . Use Amazon AppFlow to create a Row for each SaaS application. Set an Amazon S3 bucket as the destination. Schedule the flows to extract the data to the bucket. Use AWS Glue to catalog the data in the S3 bucket. Create a data source and a dataset in QuickSight.
- D . Export data the from the SaaS applications as Microsoft Excel files. Create a data source and a dataset in QuickSight by uploading the Excel files.
A banking company uses an application to collect large volumes of transactional data. The company uses Amazon Kinesis Data Streams for real-time analytics. The company’s application uses the PutRecord action to send data to Kinesis Data Streams.
A data engineer has observed network outages during certain times of day. The data engineer wants to configure exactly-once delivery for the entire processing pipeline.
Which solution will meet this requirement?
- A . Design the application so it can remove duplicates during processing by embedding a unique ID in each record at the source.
- B . Update the checkpoint configuration of the Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) data collection application to avoid duplicate processing of events.
- C . Design the data source so events are not ingested into Kinesis Data Streams multiple times.
- D . Stop using Kinesis Data Streams. Use Amazon EMR instead. Use Apache Flink and Apache Spark
Streaming in Amazon EMR.
B
Explanation:
For exactly-once delivery and processing in Amazon Kinesis Data Streams, the best approach is to design the application so that it handles idempotency. By embedding a unique ID in each record, the application can identify and remove duplicate records during processing.
Exactly-Once Processing:
Kinesis Data Streams does not natively support exactly-once processing. Therefore, idempotency should be designed into the application, ensuring that each record has a unique identifier so that the same event is processed only once, even if it is ingested multiple times.
This pattern is widely used for achieving exactly-once semantics in distributed systems.
Reference: Building Idempotent Applications with Kinesis Alternatives Considered:
B (Checkpoint configuration): While updating the checkpoint configuration can help with some aspects of duplicate processing, it is not a full solution for exactly-once delivery.
C (Design data source): Ensuring events are not ingested multiple times is ideal, but network outages can make this difficult, and it doesn’t guarantee exactly-once delivery.
D (Using EMR): While using EMR with Flink or Spark could work, it introduces unnecessary complexity compared to handling idempotency at the application level.
Reference: Amazon Kinesis Best Practices for Exactly-Once Processing Achieving Idempotency with Amazon Kinesis
A company stores CSV files in an Amazon S3 bucket. A data engineer needs to process the data in the CSV files and store the processed data in a new S3 bucket.
The process needs to rename a column, remove specific columns, ignore the second row of each file, create a new column based on the values of the first row of the data, and filter the results by a numeric value of a column.
Which solution will meet these requirements with the LEAST development effort?
- A . Use AWS Glue Python jobs to read and transform the CSV files.
- B . Use an AWS Glue custom crawler to read and transform the CSV files.
- C . Use an AWS Glue workflow to build a set of jobs to crawl and transform the CSV files.
- D . Use AWS Glue DataBrew recipes to read and transform the CSV files.
D
Explanation:
The requirement involves transforming CSV files by renaming columns, removing rows, and other operations with minimal development effort. AWS Glue DataBrew is the best solution here because it allows you to visually create transformation recipes without writing extensive code.
Option D: Use AWS Glue DataBrew recipes to read and transform the CSV files. DataBrew provides a visual interface where you can build transformation steps (e.g., renaming columns, filtering rows, creating new columns, etc.) as a "recipe" that can be applied to datasets, making it easy to handle complex transformations on CSV files with minimal coding.
Other options (A, B, C) involve more manual development and configuration effort (e.g., writing Python jobs or creating custom workflows in Glue) compared to the low-code/no-code approach of DataBrew.
Reference: AWS Glue DataBrew Documentation
The company stores a large volume of customer records in Amazon S3. To comply with regulations, the company must be able to access new customer records immediately for the first 30 days after the records are created. The company accesses records that are older than 30 days infrequently.
The company needs to cost-optimize its Amazon S3 storage.
Which solution will meet these requirements MOST cost-effectively?
- A . Apply a lifecycle policy to transition records to S3 Standard Infrequent-Access (S3 Standard-IA) storage after 30 days.
- B . Use S3 Intelligent-Tiering storage.
- C . Transition records to S3 Glacier Deep Archive storage after 30 days.
- D . Use S3 Standard-Infrequent Access (S3 Standard-IA) storage for all customer records.
A
Explanation:
The most cost-effective solution in this case is to apply a lifecycle policy to transition records to
Amazon S3 Standard-IA storage after 30 days.
Here’s why:
Amazon S3 Lifecycle Policies: Amazon S3 offers lifecycle policies that allow you to automatically transition objects between different storage classes to optimize costs. For data that is frequently accessed in the first 30 days and infrequently accessed after that, transitioning from the S3 Standard storage class to S3 Standard-Infrequent Access (S3 Standard-IA) after 30 days makes the most sense. S3 Standard-IA is designed for data that is accessed less frequently but still needs to be retained, offering lower storage costs than S3 Standard with a retrieval cost for access.
Cost Optimization: S3 Standard-IA offers a lower price per GB than S3 Standard. Since the data will be accessed infrequently after 30 days, using S3 Standard-IA will lower storage costs while still allowing for immediate retrieval when necessary.
Compliance with Regulations: Since the records need to be immediately accessible for the first 30 days, the use of S3 Standard for that period ensures compliance with regulatory requirements. After 30 days, transitioning to S3 Standard-IA continues to meet access requirements for infrequent access while reducing storage costs.
Alternatives Considered:
Option B (S3 Intelligent-Tiering): While S3 Intelligent-Tiering automatically moves data between access tiers based on access patterns, it incurs a small monthly monitoring and automation charge per object. It could be a viable option, but transitioning data to S3 Standard-IA directly would be more cost-effective since the pattern of access is well-known (frequent for 30 days, infrequent thereafter).
Option C (S3 Glacier Deep Archive): Glacier Deep Archive is the lowest-cost storage class, but it is not suitable in this case because the data needs to be accessed immediately within 30 days and on an infrequent basis thereafter. Glacier Deep Archive requires hours for data retrieval, which is not acceptable for infrequent access needs.
Option D (S3 Standard-IA for all records): Using S3 Standard-IA for all records would result in higher costs for the first 30 days, as the data is frequently accessed. S3 Standard-IA incurs retrieval charges, making it less suitable for frequently accessed data.
Amazon S3 Lifecycle Policies
S3 Storage Classes
Cost Management and Data Optimization Using Lifecycle Policies
AWS Data Engineering Documentation
A company stores sensitive data in an Amazon Redshift table. The company needs to give specific users the ability to access the sensitive data. The company must not create duplication in the data.
Customer support users must be able to see the last four characters of the sensitive data. Audit users must be able to see the full value of the sensitive data. No other users can have the ability to access the sensitive information.
Which solution will meet these requirements?
- A . Create a dynamic data masking policy to allow access based on each user role. Create IAM roles that have specific access permissions. Attach the masking policy to the column that contains sensitive data.
- B . Enable metadata security on the Redshift cluster. Create IAM users and IAM roles for the customer support users and the audit users. Grant the IAM users and IAM roles permissions to view the metadata in the Redshift cluster.
- C . Create a row-level security policy to allow access based on each user role. Create IAM roles that have specific access permissions. Attach the security policy to the table.
- D . Create an AWS Glue job to redact the sensitive data and to load the data into a new Redshift table.
A
Explanation:
Amazon Redshift supports dynamic data masking, which enables you to limit sensitive data visibility to specific users and roles without duplicating the data. This approach supports showing only parts of a column’s values (e.g., last four digits) and full visibility for authorized roles (e.g., auditors).
“With dynamic data masking, you can control how much sensitive data a user sees in query results without changing the data in the table.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf
IAM roles are used to associate users with the appropriate masking rules, keeping security tight and avoiding the creation of duplicate data views or tables.
