Practice Free Amazon DEA-C01 Exam Online Questions
A transportation company wants to track vehicle movements by capturing geolocation records. The records are 10 bytes in size. The company receives up to 10,000 records every second. Data transmission delays of a few minutes are acceptable because of unreliable network conditions.
The transportation company wants to use Amazon Kinesis Data Streams to ingest the geolocation data. The company needs a reliable mechanism to send data to Kinesis Data Streams. The company needs to maximize the throughput efficiency of the Kinesis shards.
Which solution will meet these requirements in the MOST operationally efficient way?
- A . Kinesis Agent
- B . Kinesis Producer Library (KPL)
- C . Amazon Data Firehose
- D . Kinesis SDK
B
Explanation:
Problem Analysis:
The company ingests geolocation records (10 bytes each) at 10,000 records per second into Kinesis Data Streams.
Data transmission delays are acceptable, but the solution must maximize throughput efficiency.
Key Considerations:
The Kinesis Producer Library (KPL) batches records and uses aggregation to optimize shard throughput.
Efficiently handles high-throughput scenarios with minimal operational overhead.
Solution Analysis:
Option A: Kinesis Agent
Designed for file-based ingestion; not optimized for geolocation records.
Option B: KPL
Aggregates records into larger payloads, significantly improving shard throughput.
Suitable for applications generating small, high-frequency records.
Option C: Kinesis Firehose
Firehose is for delivery to destinations like S3 or Redshift and is not optimized for direct ingestion to Kinesis Data Streams.
Option D: Kinesis SDK
The SDK lacks advanced features like aggregation, resulting in lower throughput efficiency.
Final Recommendation:
Use Kinesis Producer Library (KPL) for its built-in aggregation and batching capabilities.
Kinesis Producer Library (KPL) Overview
Best Practices for Amazon Kinesis
An ecommerce company collects daily customer transaction logs in CSV format and stores the logs in Amazon S3. The company uses Amazon Athena to scan a subset of attributes from the logs on the same day the company receives each log.
Query times are increasing because of increasing transaction volume. The company wants to improve query performance.
Which solution will meet these requirements with the SHORTEST query times?
- A . Convert the CSV logs into multiple ORC files for better parallelism in Athena. Partition by date in Amazon S3. Use columnar pushdown filters.
- B . Convert the CSV logs to JSON. Partition by date in Amazon S3. Use Athena with dynamic filtering to reduce data scans.
- C . Convert the CSV logs to Avro. Partition by date in Amazon S3. Use Athena with projection-based partitioning.
- D . Convert the CSV logs to a single Apache Parquet file for each day. Partition the data by date in Amazon S3. Use Athena with predicate pushdown filters.
D
Explanation:
Amazon Athena achieves the fastest query performance when data is stored in columnar formats such as Apache Parquet and when queries can take advantage of partition pruning and predicate pushdown.
Converting CSV files to Parquet significantly reduces the amount of data scanned because Parquet stores data in a column-oriented layout. Since Athena queries only a subset of attributes, it reads only the required columns instead of scanning entire rows, which dramatically improves performance. Predicate pushdown further reduces query time by filtering data at the storage layer.
Partitioning the data by date ensures that Athena scans only the relevant partitions for same-day queries, minimizing unnecessary data reads. Storing one Parquet file per day is efficient and avoids the overhead of managing excessive small files.
ORC is also a columnar format, but Parquet is more commonly optimized and recommended for Athena workloads in AWS exam guidance. JSON and Avro are row-based or semi-row-based formats and result in larger scan sizes and slower query execution.
Therefore, Option D provides the shortest query times and aligns with Athena performance best practices.
A data engineer is configuring an AWS Glue job to read data from an Amazon S3 bucket. The data engineer has set up the necessary AWS Glue connection details and an associated IAM role. However, when the data engineer attempts to run the AWS Glue job, the data engineer receives an error message that indicates that there are problems with the Amazon S3 VPC gateway endpoint.
The data engineer must resolve the error and connect the AWS Glue job to the S3 bucket.
Which solution will meet this requirement?
- A . Update the AWS Glue security group to allow inbound traffic from the Amazon S3 VPC gateway endpoint.
- B . Configure an S3 bucket policy to explicitly grant the AWS Glue job permissions to access the S3 bucket.
- C . Review the AWS Glue job code to ensure that the AWS Glue connection details include a fully qualified domain name.
- D . Verify that the VPC’s route table includes inbound and outbound routes for the Amazon S3 VPC gateway endpoint.
D
Explanation:
The error message indicates that the AWS Glue job cannot access the Amazon S3 bucket through the VPC endpoint. This could be because the VPC’s route table does not have the necessary routes to direct the traffic to the endpoint. To fix this, the data engineer must verify that the route table has an entry for the Amazon S3 service prefix (com.amazonaws.region.s3) with the target as the VPC endpoint ID. This will allow the AWS Glue job to use the VPC endpoint to access the S3 bucket without going through the internet or a NAT gateway. For more information, see Gateway endpoints.
Reference: Troubleshoot the AWS Glue error “VPC S3 endpoint validation failed”
Amazon VPC endpoints for Amazon S3
[AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide]
A company uses Amazon Redshift as its data warehouse service. A data engineer needs to design a physical data model.
The data engineer encounters a de-normalized table that is growing in size. The table does not have a suitable column to use as the distribution key.
Which distribution style should the data engineer use to meet these requirements with the LEAST maintenance overhead?
- A . ALL distribution
- B . EVEN distribution
- C . AUTO distribution
- D . KEY distribution
A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is less than 100 MB.
Which solution will meet these requirements MOST cost-effectively?
- A . Write a custom Python application. Host the application on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.
- B . Write a PySpark ETL script. Host the script on an Amazon EMR cluster.
- C . Write an AWS Glue PySpark job. Use Apache Spark to transform the data.
- D . Write an AWS Glue Python shell job. Use pandas to transform the data.
D
Explanation:
AWS Glue is a fully managed serverless ETL service that can handle various data sources and formats, including .csv files in Amazon S3. AWS Glue provides two types of jobs: PySpark and Python shell. PySpark jobs use Apache Spark to process large-scale data in parallel, while Python shell jobs use Python scripts to process small-scale data in a single execution environment. For this requirement, a Python shell job is more suitable and cost-effective, as the size of each S3 object is less than 100 MB, which does not require distributed processing. A Python shell job can use pandas, a popular Python library for data analysis, to transform the .csv data as needed. The other solutions are not optimal or relevant for this requirement. Writing a custom Python application and hosting it on an Amazon EKS cluster would require more effort and resources to set up and manage the Kubernetes environment, as well as to handle the data ingestion and transformation logic. Writing a PySpark ETL script and hosting it on an Amazon EMR cluster would also incur more costs and complexity to provision and configure the EMR cluster, as well as to use Apache Spark for processing small data files. Writing an AWS Glue PySpark job would also be less efficient and economical than a Python shell job, as it would involve unnecessary overhead and charges for using Apache Spark for small data files.
Reference: AWS Glue
Working with Python Shell Jobs
pandas
[AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide]
A company has a data processing pipeline that runs multiple SQL queries in sequence against an Amazon Redshift cluster. After a merger, a query joining two large sales tables becomes slow. Table S1 has 10 billion records, Table S2 has 900 million records.
The query performance must improve.
- A . Use the KEY distribution style for both sales tables. Select a low cardinality column to use for the join.
- B . Use the KEY distribution style for both sales tables. Select a high cardinality column to use for the join.
- C . Use the EVEN distribution style for Table S1. Use the ALL distribution style for Table S2.
- D . Use the Amazon Redshift query optimizer to review and select optimizations to implement.
- E . Use Amazon Redshift Advisor to review and select optimizations to implement.
B,E
Explanation:
To optimize joins between large tables, Redshift recommends using KEY distribution on a common, high-cardinality join key to ensure co-location of data blocks. The Amazon Redshift Advisor identifies performance bottlenecks and provides recommendations for distribution and sort keys.
“Use KEY distribution when joining large tables on a common high-cardinality column. Use Redshift Advisor to identify query optimization opportunities.”
A company’s application needs to search and analyze data in near real time. The application must handle up to 1,000 requests each second with low query latency. The company wants a solution that individual data teams can own and configure to meet each team’s cost and performance optimization requirements.
Which solution will meet these requirements?
- A . Use Amazon S3 buckets to store the data. Use Amazon Athena to query and analyze the data.
Assign each data team a separate S3 bucket prefix to optimize queries. - B . Use streams in Amazon Kinesis Data Streams and Amazon Managed Service for Apache Flink to query and analyze the data. Assign each data team a separate stream to manage and consume.
- C . Use Amazon OpenSearch Service clusters with indexing to query the data. Assign each data team a separate cluster to configure for storage and queries.
- D . Use Amazon Aurora clusters that run on Aurora I/O-Optimized instances. Assign each data team a separate Aurora cluster to configure for storage and queries.
C
Explanation:
Option C is correct because Amazon OpenSearch Service is designed for search and analytics in near real time with low-latency queries. AWS documentation states that Amazon OpenSearch Service makes it easy to deploy, secure, operate, and scale OpenSearch to search, analyze, and visualize data in real time, and that it provides real-time analytics for use cases such as log analytics, full-text search, application monitoring, and clickstream analytics. That aligns directly with the requirement for handling many requests per second with low query latency.
This option also best satisfies the requirement that individual data teams can own and configure their own environments for cost and performance optimization. By assigning each team a separate OpenSearch cluster, each team can independently tune indexing, storage, shard layout, retention, and scaling policies.
Option A is less suitable because Amazon Athena is serverless SQL over data in S3 and is not the best fit for low-latency, high-request-rate interactive search workloads.
Option B focuses on streaming data processing, not primary low-latency indexed search.
Option D uses a relational database, which is not the native AWS choice for large-scale search and analytics. Therefore, OpenSearch Service is the most appropriate and scalable solution.
A company stores data in a data lake that is in Amazon S3. Some data that the company stores in the data lake contains personally identifiable information (PII). Multiple user groups need to access the raw data. The company must ensure that user groups can access only the PII that they require.
Which solution will meet these requirements with the LEAST effort?
- A . Use Amazon Athena to query the data. Set up AWS Lake Formation and create data filters to establish levels of access for the company’s IAM roles. Assign each user to the IAM role that matches the user’s PII access requirements.
- B . Use Amazon QuickSight to access the data. Use column-level security features in QuickSight to limit the PII that users can retrieve from Amazon S3 by using Amazon Athena. Define QuickSight access levels based on the PII access requirements of the users.
- C . Build a custom query builder UI that will run Athena queries in the background to access the data. Create user groups in Amazon Cognito. Assign access levels to the user groups based on the PII access requirements of the users.
- D . Create IAM roles that have different levels of granular access. Assign the IAM roles to IAM user groups. Use an identity-based policy to assign access levels to user groups at the column level.
A
Explanation:
Amazon Athena is a serverless, interactive query service that enables you to analyze data in Amazon S3 using standard SQL. AWS Lake Formation is a service that helps you build, secure, and manage data lakes on AWS. You can use AWS Lake Formation to create data filters that define the level of access for different IAM roles based on the columns, rows, or tags of the data. By using Amazon Athena to query the data and AWS Lake Formation to create data filters, the company can meet the requirements of ensuring that user groups can access only the PII that they require with the least effort. The solution is to use Amazon Athena to query the data in the data lake that is in Amazon S3. Then, set up AWS Lake Formation and create data filters to establish levels of access for the company’s IAM roles. For example, a data filter can allow a user group to access only the columns that contain the PII that they need, such as name and email address, and deny access to the columns that contain the PII that they do not need, such as phone number and social security number. Finally, assign each user to the IAM role that matches the user’s PII access requirements. This way, the user groups can access the data in the data lake securely and efficiently. The other options are either not feasible or not optimal. Using Amazon QuickSight to access the data (option B) would require the company to pay for the QuickSight service and to configure the column-level security features for each user. Building a custom query builder UI that will run Athena queries in the background to access the data (option C) would require the company to develop and maintain the UI and to integrate it with Amazon Cognito. Creating IAM roles that have different levels of granular access (option D) would require the company to manage multiple IAM roles and policies and to ensure that they are aligned with the data schema.
Reference: Amazon Athena
AWS Lake Formation
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide, Chapter 4: Data Analysis
and Visualization, Section 4.3: Amazon Athena
A company needs a solution to store and query product data that has variable attributes. The solution must support unpredictable and high-volume queries with single-digit millisecond latency, even during sudden traffic spikes. The solution must retrieve items by a primary identifier named Product ID. The solution must allow flexible queries by secondary attributes named Category and Brand.
Which solution will meet these requirements?
- A . Use an Amazon DynamoDB table with on-demand capacity to store product data. Store products by primary key. Use global secondary indexes (GSIs) to store secondary attributes.
- B . Use Amazon Aurora with a Multi-AZ deployment to store product data. Use read replicas. Create indexes for primary and secondary attributes.
- C . Use an Amazon OpenSearch Serverless cluster with dynamic scaling to store product data. Index product data by primary and secondary attributes.
- D . Use Amazon ElastiCache (Redis OSS) and Amazon S3 to store product data. Use Amazon Athena to run flexible secondary attribute queries.
A
Explanation:
Option A is the correct design for single-digit millisecond latency with unpredictable spikes and variable attributes. The study material describes Amazon DynamoDB as a NoSQL database “designed for highly dynamic datasets with frequent read and write operations,” providing low-latency performance at any scale―which directly matches the latency and traffic-spike requirements.
DynamoDB’s key-value and document model fits “product data that has variable attributes” because items can contain different attributes without needing schema migrations typical of relational databases. The requirement to retrieve items by Product ID maps naturally to DynamoDB’s primary key access pattern. The requirement for flexible queries on Category and Brand is met by creating global secondary indexes (GSIs) on those attributes so queries can be served efficiently without scanning the whole table.
Option B (Aurora) can scale reads, but it is not typically the best fit for sustained single-digit millisecond performance during sudden spikes without careful capacity planning.
Option C is optimized for search and text/query relevance rather than primary-key transactional access patterns.
Option D uses Athena (interactive SQL over S3) which is not designed for millisecond-latency, high-QPS query workloads.
A company needs to build a data lake in AWS. The company must provide row-level data access and column-level data access to specific teams. The teams will access the data by using Amazon Athena, Amazon Redshift Spectrum, and Apache Hive from Amazon EMR.
Which solution will meet these requirements with the LEAST operational overhead?
- A . Use Amazon S3 for data lake storage. Use S3 access policies to restrict data access by rows and columns. Provide data access through Amazon S3.
- B . Use Amazon S3 for data lake storage. Use Apache Ranger through Amazon EMR to restrict data access by rows and columns. Provide data access by using Apache Pig.
- C . Use Amazon Redshift for data lake storage. Use Redshift security policies to restrict data access by rows and columns. Provide data access by using Apache Spark and Amazon Athena federated queries.
- D . Use Amazon S3 for data lake storage. Use AWS Lake Formation to restrict data access by rows and columns. Provide data access through AWS Lake Formation.
D
Explanation:
Option D is the best solution to meet the requirements with the least operational overhead because AWS Lake Formation is a fully managed service that simplifies the process of building, securing, and managing data lakes. AWS Lake Formation allows you to define granular data access policies at the row and column level for different users and groups. AWS Lake Formation also integrates with Amazon Athena, Amazon Redshift Spectrum, and Apache Hive on Amazon EMR, enabling these services to access the data in the data lake through AWS Lake Formation.
Option A is not a good solution because S3 access policies cannot restrict data access by rows and columns. S3 access policies are based on the identity and permissions of the requester, the bucket and object ownership, and the object prefix and tags. S3 access policies cannot enforce fine-grained data access control at the row and column level.
Option B is not a good solution because it involves using Apache Ranger and Apache Pig, which are not fully managed services and require additional configuration and maintenance. Apache Ranger is a framework that provides centralized security administration for data stored in Hadoop clusters, such as Amazon EMR. Apache Ranger can enforce row-level and column-level access policies for Apache Hive tables. However, Apache Ranger is not a native AWS service and requires manual installation and configuration on Amazon EMR clusters. Apache Pig is a platform that allows you to analyze large data sets using a high-level scripting language called Pig Latin. Apache Pig can access data stored in Amazon S3 and process it using Apache Hive. However, Apache Pig is not a native AWS service and requires manual installation and configuration on Amazon EMR clusters.
Option C is not a good solution because Amazon Redshift is not a suitable service for data lake storage. Amazon Redshift is a fully managed data warehouse service that allows you to run complex analytical queries using standard SQL. Amazon Redshift can enforce row-level and column-level access policies for different users and groups. However, Amazon Redshift is not designed to store and process large volumes of unstructured or semi-structured data, which are typical characteristics of data lakes. Amazon Redshift is also more expensive and less scalable than Amazon S3 for data lake storage.
AWS Certified Data Engineer – Associate DEA-C01 Complete Study Guide
What Is AWS Lake Formation? – AWS Lake Formation
Using AWS Lake Formation with Amazon Athena – AWS Lake Formation
Using AWS Lake Formation with Amazon Redshift Spectrum – AWS Lake Formation
Using AWS Lake Formation with Apache Hive on Amazon EMR – AWS Lake Formation
Using Bucket Policies and User Policies – Amazon Simple Storage Service
Apache Ranger
Apache Pig
What Is Amazon Redshift? – Amazon Redshift
