Practice Free Amazon DEA-C01 Exam Online Questions
A company uploads .csv files to an Amazon S3 bucket. The company’s data platform team has set up an AWS Glue crawler to perform data discovery and to create the tables and schemas.
An AWS Glue job writes processed data from the tables to an Amazon Redshift database. The AWS Glue job handles column mapping and creates the Amazon Redshift tables in the Redshift database appropriately.
If the company reruns the AWS Glue job for any reason, duplicate records are introduced into the Amazon Redshift tables. The company needs a solution that will update the Redshift tables without duplicates.
Which solution will meet these requirements?
- A . Modify the AWS Glue job to copy the rows into a staging Redshift table. Add SQL commands to update the existing rows with new values from the staging Redshift table.
- B . Modify the AWS Glue job to load the previously inserted data into a MySQL database. Perform an upsert operation in the MySQL database. Copy the results to the Amazon Redshift tables.
- C . Use Apache Spark’s DataFrame dropDuplicates() API to eliminate duplicates. Write the data to the Redshift tables.
- D . Use the AWS Glue ResolveChoice built-in transform to select the value of the column from the most recent record.
A company is using Amazon Redshift to build a data warehouse solution. The company is loading hundreds of tiles into a tact table that is in a Redshift cluster.
The company wants the data warehouse solution to achieve the greatest possible throughput. The solution must use cluster resources optimally when the company loads data into the tact table.
Which solution will meet these requirements?
- A . Use multiple COPY commands to load the data into the Redshift cluster.
- B . Use S3DistCp to load multiple files into Hadoop Distributed File System (HDFS). Use an HDFS connector to ingest the data into the Redshift cluster.
- C . Use a number of INSERT statements equal to the number of Redshift cluster nodes. Load the data in parallel into each node.
- D . Use a single COPY command to load the data into the Redshift cluster.
D
Explanation:
To achieve the highest throughput and efficiently use cluster resources while loading data into an Amazon Redshift cluster, the optimal approach is to use a single COPY command that ingests data in parallel.
Option D: Use a single COPY command to load the data into the Redshift cluster. The COPY command is designed to load data from multiple files in parallel into a Redshift table, using all the cluster nodes to optimize the load process. Redshift is optimized for parallel processing, and a single COPY command can load multiple files at once, maximizing throughput.
Options A, B, and C either involve unnecessary complexity or inefficient approaches, such as using multiple COPY commands or INSERT statements, which are not optimized for bulk loading.
Reference: Amazon Redshift COPY Command Documentation
A company wants to migrate data from an Amazon RDS for PostgreSQL DB instance in the eu-east-1 Region of an AWS account named Account_A. The company will migrate the data to an Amazon Redshift cluster in the eu-west-1 Region of an AWS account named Account_B.
Which solution will give AWS Database Migration Service (AWS DMS) the ability to replicate data between two data stores?
- A . Set up an AWS DMS replication instance in Account_B in eu-west-1.
- B . Set up an AWS DMS replication instance in Account_B in eu-east-1.
- C . Set up an AWS DMS replication instance in a new AWS account in eu-west-1
- D . Set up an AWS DMS replication instance in Account_A in eu-east-1.
A
Explanation:
To migrate data from an Amazon RDS for PostgreSQL DB instance in the eu-east-1 Region (Account_A) to an Amazon Redshift cluster in the eu-west-1 Region (Account_B), AWS DMS needs a replication instance located in the target region (in this case, eu-west-1) to facilitate the data transfer between regions.
Option A: Set up an AWS DMS replication instance in Account_B in eu-west-1. Placing the DMS replication instance in the target account and region (Account_B in eu-west-1) is the most efficient solution. The replication instance can connect to the source RDS PostgreSQL in eu-east-1 and migrate the data to the Redshift cluster in eu-west-1. This setup ensures data is replicated across AWS accounts and regions.
Options B, C, and D place the replication instance in either the wrong account or region, which increases complexity without adding any benefit.
Reference: AWS Database Migration Service (DMS) Documentation
Cross-Region and Cross-Account Replication
A company needs to store semi-structured transactional data in a serverless database.
The application writes data infrequently but reads it frequently, with millisecond retrieval required.
- A . Store the data in an Amazon S3 Standard bucket. Enable S3 Transfer Acceleration.
- B . Store the data in an Amazon S3 Apache Iceberg table. Enable S3 Transfer Acceleration.
- C . Store the data in an Amazon RDS for MySQL cluster. Configure RDS Optimized Reads.
- D . Store the data in an Amazon DynamoDB table. Configure a DynamoDB Accelerator (DAX) cache.
D
Explanation:
Amazon DynamoDB is a serverless, low-latency, NoSQL database ideal for semi-structured data.
Adding DynamoDB Accelerator (DAX) provides microsecond response times for read-heavy workloads.
“For applications requiring millisecond or sub-millisecond reads with serverless operation, use DynamoDB with DAX caching.”
C Ace the AWS Certified Data Engineer – Associate Certification – version 2 – apple.pdf
A company stores server logs in an Amazon 53 bucket. The company needs to keep the logs for 1 year. The logs are not required after 1 year.
A data engineer needs a solution to automatically delete logs that are older than 1 year.
Which solution will meet these requirements with the LEAST operational overhead?
- A . Define an S3 Lifecycle configuration to delete the logs after 1 year.
- B . Create an AWS Lambda function to delete the logs after 1 year.
- C . Schedule a cron job on an Amazon EC2 instance to delete the logs after 1 year.
- D . Configure an AWS Step Functions state machine to delete the logs after 1 year.
B
Explanation:
Problem Analysis:
The company uses AWS Glue for ETL pipelines and requires automatic data quality checks during pipeline execution.
The solution must integrate with existing AWS Glue pipelines and evaluate data quality rules based on predefined thresholds.
Key Considerations:
Ensure minimal implementation effort by leveraging built-in AWS Glue features.
Use a standardized approach for defining and evaluating data quality rules.
Avoid custom libraries or external frameworks unless absolutely necessary.
Solution Analysis:
Option A: SQL Transform
Adding SQL transforms to define and evaluate data quality rules is possible but requires writing complex queries for each rule.
Increases operational overhead and deviates from Glue’s declarative approach.
Option B: Evaluate Data Quality Transform with DQDL
AWS Glue provides a built-in Evaluate Data Quality transform.
Allows defining rules in Data Quality Definition Language (DQDL), a concise and declarative way to define quality checks.
Fully integrated with Glue Studio, making it the least effort solution.
Option C: Custom Transform with PyDeequ
PyDeequ is a powerful library for data quality checks but requires custom code and integration.
Increases implementation effort compared to Glue’s native capabilities.
Option D: Custom Transform with Great Expectations
Great Expectations is another powerful library for data quality but adds complexity and external dependencies.
Final Recommendation:
Use Evaluate Data Quality transform in AWS Glue.
Define rules in DQDL for checking thresholds, null values, or other quality criteria.
This approach minimizes development effort and ensures seamless integration with AWS Glue.
AWS Glue Data Quality Overview
DQDL Syntax and Examples
Glue Studio Transformations
A financial services company stores financial data in Amazon Redshift. A data engineer wants to run real-time queries on the financial data to support a web-based trading application. The data engineer wants to run the queries from within the trading application.
Which solution will meet these requirements with the LEAST operational overhead?
- A . Establish WebSocket connections to Amazon Redshift.
- B . Use the Amazon Redshift Data API.
- C . Set up Java Database Connectivity (JDBC) connections to Amazon Redshift.
- D . Store frequently accessed data in Amazon S3. Use Amazon S3 Select to run the queries.
B
Explanation:
The Amazon Redshift Data API is a built-in feature that allows you to run SQL queries on Amazon Redshift data with web services-based applications, such as AWS Lambda, Amazon SageMaker notebooks, and AWS Cloud9. The Data API does not require a persistent connection to your database, and it provides a secure HTTP endpoint and integration with AWS SDKs. You can use the endpoint to run SQL statements without managing connections. The Data API also supports both Amazon Redshift provisioned clusters and Redshift Serverless workgroups. The Data API is the best solution for running real-time queries on the financial data from within the trading application, as it has the least operational overhead compared to the other options.
Option A is not the best solution, as establishing WebSocket connections to Amazon Redshift would require more configuration and maintenance than using the Data API. WebSocket connections are also not supported by Amazon Redshift clusters or serverless workgroups.
Option C is not the best solution, as setting up JDBC connections to Amazon Redshift would also require more configuration and maintenance than using the Data API. JDBC connections are also not supported by Redshift Serverless workgroups.
Option D is not the best solution, as storing frequently accessed data in Amazon S3 and using Amazon S3 Select to run the queries would introduce additional latency and complexity than using the Data API. Amazon S3 Select is also not optimized for real-time queries, as it scans the entire object before returning the results.
Reference: Using the Amazon Redshift Data API
Calling the Data API
Amazon Redshift Data API Reference
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide
A security company stores IoT data that is in JSON format in an Amazon S3 bucket. The data structure can change when the company upgrades the IoT devices. The company wants to create a data catalog that includes the IoT data. The company’s analytics department will use the data catalog to index the data.
Which solution will meet these requirements MOST cost-effectively?
- A . Create an AWS Glue Data Catalog. Configure an AWS Glue Schema Registry. Create a new AWS Glue workload to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless.
- B . Create an Amazon Redshift provisioned cluster. Create an Amazon Redshift Spectrum database for the analytics department to explore the data that is in Amazon S3. Create Redshift stored procedures to load the data into Amazon Redshift.
- C . Create an Amazon Athena workgroup. Explore the data that is in Amazon S3 by using Apache Spark through Athena. Provide the Athena workgroup schema and tables to the analytics department.
- D . Create an AWS Glue Data Catalog. Configure an AWS Glue Schema Registry. Create AWS Lambda user defined functions (UDFs) by using the Amazon Redshift Data API. Create an AWS Step Functions job to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless.
C
Explanation:
The best solution to meet the requirements of creating a data catalog that includes the IoT data, and allowing the analytics department to index the data, most cost-effectively, is to create an Amazon Athena workgroup, explore the data that is in Amazon S3 by using Apache Spark through Athena, and provide the Athena workgroup schema and tables to the analytics department.
Amazon Athena is a serverless, interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL or Python1. Amazon Athena also supports Apache Spark, an open-source distributed processing framework that can run large-scale data analytics applications across clusters of servers2. You can use Athena to run Spark code on data in Amazon S3 without having to set up, manage, or scale any infrastructure. You can also use Athena to create and manage external tables that point to your data in Amazon S3, and store them in an external data catalog, such as AWS Glue Data Catalog, Amazon Athena Data Catalog, or your own Apache Hive metastore3. You can create Athena workgroups to separate query execution and resource allocation based on different criteria, such as users, teams, or applications4. You can share the schemas and tables in your Athena workgroup with other users or applications, such as Amazon QuickSight, for data visualization and analysis5.
Using Athena and Spark to create a data catalog and explore the IoT data in Amazon S3 is the most cost-effective solution, as you pay only for the queries you run or the compute you use, and you pay nothing when the service is idle1. You also save on the operational overhead and complexity of managing data warehouse infrastructure, as Athena and Spark are serverless and scalable. You can also benefit from the flexibility and performance of Athena and Spark, as they support various data formats, including JSON, and can handle schema changes and complex queries efficiently.
Option A is not the best solution, as creating an AWS Glue Data Catalog, configuring an AWS Glue Schema Registry, creating a new AWS Glue workload to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless, would incur more costs and complexity than using Athena and Spark. AWS Glue Data Catalog is a persistent metadata store that contains table definitions, job definitions, and other control information to help you manage your AWS Glue components6. AWS Glue Schema Registry is a service that allows you to centrally store and manage the schemas of your streaming data in AWS Glue Data Catalog7. AWS Glue is a serverless data integration service that makes it easy to prepare, clean, enrich, and move data between data stores8. Amazon Redshift Serverless is a feature of Amazon Redshift, a fully managed data warehouse service, that allows you to run and scale analytics without having to manage data warehouse infrastructure9. While these services are powerful and useful for many data engineering scenarios, they are not necessary or cost-effective for creating a data catalog and indexing the IoT data in Amazon S3. AWS Glue Data Catalog and Schema Registry charge you based on the number of objects stored and the number of requests made67. AWS Glue charges you based on the compute time and the data processed by your ETL jobs8. Amazon Redshift Serverless charges you based on the amount of data scanned by your queries and the compute time used by your workloads9. These costs can add up quickly, especially if you have large volumes of IoT data and frequent schema changes. Moreover, using AWS Glue and Amazon Redshift Serverless would introduce additional latency and complexity, as you would have to ingest the data from Amazon S3 to Amazon Redshift Serverless, and then query it from there, instead of querying it directly from Amazon S3 using Athena and Spark.
Option B is not the best solution, as creating an Amazon Redshift provisioned cluster, creating an Amazon Redshift Spectrum database for the analytics department to explore the data that is in Amazon S3, and creating Redshift stored procedures to load the data into Amazon Redshift, would incur more costs and complexity than using Athena and Spark. Amazon Redshift provisioned clusters are clusters that you create and manage by specifying the number and type of nodes, and the amount of storage and compute capacity10. Amazon Redshift Spectrum is a feature of Amazon Redshift that allows you to query and join data across your data warehouse and your data lake using standard SQL11. Redshift stored procedures are SQL statements that you can define and store in Amazon Redshift, and then call them by using the CALL command12. While these features are powerful and useful for many data warehousing scenarios, they are not necessary or cost-effective for creating a data catalog and indexing the IoT data in Amazon S3. Amazon Redshift provisioned clusters charge you based on the node type, the number of nodes, and the duration of the cluster10. Amazon Redshift Spectrum charges you based on the amount of data scanned by your queries11. These costs can add up quickly, especially if you have large volumes of IoT data and frequent schema changes. Moreover, using Amazon Redshift provisioned clusters and Spectrum would introduce additional latency and complexity, as you would have to provision and manage the cluster, create an external schema and database for the data in Amazon S3, and load the data into the cluster using stored procedures, instead of querying it directly from Amazon S3 using Athena and Spark.
Option D is not the best solution, as creating an AWS Glue Data Catalog, configuring an AWS Glue Schema Registry, creating AWS Lambda user defined functions (UDFs) by using the Amazon Redshift Data API, and creating an AWS Step Functions job to orchestrate the ingestion of the data that the analytics department will use into Amazon Redshift Serverless, would incur more costs and complexity than using Athena and Spark. AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers13. AWS Lambda UDFs are Lambda functions that you can invoke from within an Amazon Redshift query. Amazon Redshift Data API is a service that allows you to run SQL statements on Amazon Redshift clusters using HTTP requests, without needing a persistent connection. AWS Step Functions is a service that lets you coordinate multiple AWS services into serverless workflows. While these services are powerful and useful for many data engineering scenarios, they are not necessary or cost-effective for creating a data catalog and indexing the IoT data in Amazon S3. AWS Glue Data Catalog and Schema Registry charge you based on the number of objects stored and the number of requests made67. AWS Lambda charges you based on the number of requests and the duration of your functions13. Amazon Redshift Serverless charges you based on the amount of data scanned by your queries and the compute time used by your workloads9. AWS Step Functions charges you based on the number of state transitions in your workflows. These costs can add up quickly, especially if you have large volumes of IoT data and frequent schema changes. Moreover, using AWS Glue, AWS Lambda, Amazon Redshift Data API, and AWS Step Functions would introduce additional latency and complexity, as you would have to create and invoke Lambda functions to ingest the data from Amazon S3 to Amazon Redshift Serverless using the Data API, and coordinate the ingestion process using Step Functions, instead of querying it directly from Amazon S3 using Athena and Spark.
Reference: What is Amazon Athena?
Apache Spark on Amazon Athena
Creating tables, updating the schema, and adding new partitions in the Data Catalog from AWS Glue ETL jobs
Managing Athena workgroups
Using Amazon QuickSight to visualize data in Amazon Athena
AWS Glue Data Catalog
AWS Glue Schema Registry
What is AWS Glue?
Amazon Redshift Serverless
Amazon Redshift provisioned clusters
Querying external data using Amazon Redshift Spectrum
Using stored procedures in Amazon Redshift
What is AWS Lambda?
[Creating and using AWS Lambda UDFs]
[Using the Amazon Redshift Data API]
[What is AWS Step Functions?]
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide
A data engineer is using Amazon Athena to analyze sales data that is in Amazon S3. The data engineer writes a query to retrieve sales amounts for 2023 for several products from a table named sales_data. However, the query does not return results for all of the products that are in the sales_data table. The data engineer needs to troubleshoot the query to resolve the issue.
The data engineer’s original query is as follows:
SELECT product_name, sum(sales_amount)
FROM sales_data
WHERE year = 2023
GROUP BY product_name
How should the data engineer modify the Athena query to meet these requirements?
- A . Replace sum(sales amount) with count(*J for the aggregation.
- B . Change WHERE year = 2023 to WHERE extractlyear FROM sales data) = 2023.
- C . Add HAVING sumfsales amount) > 0 after the GROUP BY clause.
- D . Remove the GROUP BY clause
B
Explanation:
The original query does not return results for all of the products because the year column in the sales_data table is not an integer, but a timestamp. Therefore, the WHERE clause does not filter the data correctly, and only returns the products that have a null value for the year column. To fix this, the data engineer should use the extract function to extract the year from the timestamp and compare it with 2023. This way, the query will return the correct results for all of the products in the sales_data table. The other options are either incorrect or irrelevant, as they do not address the root cause of the issue. Replacing sum with count does not change the filtering condition, adding HAVING clause does not affect the grouping logic, and removing the GROUP BY clause does not solve the problem of missing products.
Reference: Troubleshooting JSON queries – Amazon Athena (Section: JSON related errors)
When I query a table in Amazon Athena, the TIMESTAMP result is empty (Section: Resolution)
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide (Chapter 7, page 197)
A company runs a data pipeline that uses AWS Step Functions to orchestrate AWS Lambda functions and AWS Glue jobs. The Lambda functions and AWS Glue jobs require access to multiple Amazon RDS databases. The Lambda functions and AWS Glue jobs already have access to the VPC that hosts the RDS databases.
Which solution will meet these requirements in the MOST secure way?
- A . Use the root user of the company’s AWS account to create long-term access keys for the RDS databases. Include the access keys programmatically in the Lambda functions and AWS Glue jobs. Generate new keys every 90 days.
- B . Create an IAM role that has permissions to access the RDS databases. Create a second IAM role for the Lambda functions and AWS Glue jobs that has permissions to assume the IAM role that has access permissions for the RDS databases.
- C . Create an IAM user that can assume IAM roles that have permissions and credentials to access the RDS databases. Assign the IAM user to each of the Lambda functions and AWS Glue jobs.
- D . Create Java Database Connectivity (JDBC) connections between the Lambda functions and AWS Glue jobs and the RDS databases. In the connection string, include the necessary credentials.
B
Explanation:
Comprehensive and Detailed Explanation (150C250 words)
AWS security best practices require avoiding long-term credentials and enforcing the principle of least privilege. Using IAM roles with role assumption is the most secure approach for granting temporary, scoped access to AWS resources.
In this solution, an IAM role is created with permissions to access the Amazon RDS databases. A second IAM role is assigned to the AWS Lambda functions and AWS Glue jobs with permission to assume the RDS-access role. This approach eliminates hard-coded credentials, enables automatic credential rotation through AWS Security Token Service (AWS STS), and provides clear separation of duties.
Using the root user or long-term access keys is explicitly discouraged by AWS. Creating IAM users for services violates best practices, as AWS services should use IAM roles, not users. Embedding credentials in JDBC connection strings exposes sensitive information and increases the risk of credential leakage.
Therefore, using IAM role assumption provides the strongest security posture, auditability, and compliance with AWS Well-Architected Framework guidance.
A data engineer needs to maintain a central metadata repository that users access through Amazon EMR and Amazon Athena queries. The repository needs to provide the schema and properties of many tables. Some of the metadata is stored in Apache Hive. The data engineer needs to import the metadata from Hive into the central metadata repository.
Which solution will meet these requirements with the LEAST development effort?
- A . Use Amazon EMR and Apache Ranger.
- B . Use a Hive metastore on an EMR cluster.
- C . Use the AWS Glue Data Catalog.
- D . Use a metastore on an Amazon RDS for MySQL DB instance.
C
Explanation:
The AWS Glue Data Catalog is an Apache Hive metastore-compatible catalog that provides a central metadata repository for various data sources and formats. You can use the AWS Glue Data Catalog as an external Hive metastore for Amazon EMR and Amazon Athena queries, and import metadata from existing Hive metastores into the Data Catalog. This solution requires the least development effort, as you can use AWS Glue crawlers to automatically discover and catalog the metadata from Hive, and use the AWS Glue console, AWS CLI, or Amazon EMR API to configure the Data Catalog as the Hive metastore. The other options are either more complex or require additional steps, such as setting up Apache Ranger for security, managing a Hive metastore on an EMR cluster or an RDS instance, or migrating the metadata manually.
Reference: Using the AWS Glue Data Catalog as the metastore for Hive (Section: Specifying AWS Glue Data Catalog as the metastore)
Metadata Management: Hive Metastore vs AWS Glue (Section: AWS Glue Data Catalog)
AWS Glue Data Catalog support for Spark SQL jobs (Section: Importing metadata from an existing Hive metastore)
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide (Chapter 5, page 131)
