Practice Free Professional Data Engineer Exam Online Questions
You need to move 2 PB of historical data from an on-premises storage appliance to Cloud Storage within six months, and your outbound network capacity is constrained to 20 Mb/sec.
How should you migrate this data to Cloud Storage?
- A . Use Transfer Appliance to copy the data to Cloud Storage
- B . Use gsutil cp CJ to compress the content being uploaded to Cloud Storage
- C . Create a private URL for the historical data, and then use Storage Transfer Service to copy the data to Cloud Storage
- D . Use trickle or ionice along with gsutil cp to limit the amount of bandwidth gsutil utilizes to less than 20 Mb/sec so it does not interfere with the production traffic
You need to move 2 PB of historical data from an on-premises storage appliance to Cloud Storage within six months, and your outbound network capacity is constrained to 20 Mb/sec.
How should you migrate this data to Cloud Storage?
- A . Use Transfer Appliance to copy the data to Cloud Storage
- B . Use gsutil cp CJ to compress the content being uploaded to Cloud Storage
- C . Create a private URL for the historical data, and then use Storage Transfer Service to copy the data to Cloud Storage
- D . Use trickle or ionice along with gsutil cp to limit the amount of bandwidth gsutil utilizes to less than 20 Mb/sec so it does not interfere with the production traffic
You are selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You want to minimize service costs. You also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention.
What should you do?
- A . Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.
- B . Use Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.
- C . Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the
default autoscaling setting for worker instances. - D . Use Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Compute Engine machine types when needed.
You are using Workflows to call an API that returns a 1 KB JSON response, apply some complex business logic on this response, wait for the logic to complete, and then perform a load from a Cloud Storage file to BigQuery. The Workflows standard library does not have sufficient capabilities to perform your complex logic, and you want to use Python’s standard library instead. You want to optimize your workflow for simplicity and speed of execution.
What should you do?
- A . Invoke a Cloud Function instance that uses Python to apply the logic on your JSON file.
- B . Invoke a subworkflow in Workflows to apply the logic on your JSON file.
- C . Create a Cloud Composer environment and run the logic in Cloud Composer.
- D . Create a Dataproc cluster, and use PySpark to apply the logic on your JSON file.
You are building a teal-lime prediction engine that streams files, which may contain Pll (personal identifiable information) data, into Cloud Storage and eventually into BigQuery You want to ensure that the sensitive data is masked but still maintains referential Integrity, because names and emails are often used as join keys.
How should you use the Cloud Data Loss Prevention API (DLP API) to ensure that the Pll data is not accessible by unauthorized individuals?
- A . Create a pseudonym by replacing the Pll data with cryptogenic tokens, and store the non-tokenized data in a locked-down button.
- B . Redact all Pll data, and store a version of the unredacted data in a locked-down bucket
- C . Scan every table in BigQuery, and mask the data it finds that has Pll
- D . Create a pseudonym by replacing Pll data with a cryptographic format-preserving token
You are creating a data model in BigQuery that will hold retail transaction data. Your two largest tables, sales_transation_header and sales_transation_line. have a tightly coupled immutable relationship. These tables are rarely modified after load and are frequently joined when queried. You need to model the sales_transation_header and sales_transation_line tables to improve the performance of data analytics queries.
What should you do?
- A . Create a sal es_transaction table that Stores the sales_tran3action_header and sales_transaction_line data as a JSON data type.
- B . Create a sale3_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields.
- C . Create a sale_transaction table that holds the sales_transaction_header and
sales_transaction_line information as rows, duplicating the sales_transaction_header data for each line. - D . Create separate sales_transation_header and sales_transation_line tables and. when querying, specify the sales transition line first in the WHERE clause.
B
Explanation:
BigQuery supports nested and repeated fields, which are complex data types that can represent hierarchical and one-to-many relationships within a single table. By using nested and repeated fields, you can denormalize your data model and reduce the number of joins required for your queries. This can improve the performance and efficiency of your data analytics queries, as joins can be expensive and require shuffling data across nodes. Nested and repeated fields also preserve the data integrity and avoid data duplication. In this scenario, the sales_transaction_header and sales_transaction_line tables have a tightly coupled immutable relationship, meaning that each header row corresponds to one or more line rows, and the data is rarely modified after load. Therefore, it makes sense to create a single sales_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields. This way, you can query the sales transaction data without joining two tables, and use dot notation or array functions to access the nested and repeated fields. For example, the sales_transaction table could have the following schema:
To query the total amount of each order, you could use the following SQL statement:
SQL
SELECT id, SUM(line_items.quantity * line_items.price) AS total_amount
FROM sales_transaction
GROUP BY id;
AI-generated code. Review and use carefully. More info on FAQ.
Reference: Use nested and repeated fields
BigQuery explained: Working with joins, nested & repeated data
Arrays in BigQuery ―.
How to improve query performance and optimise storage
You are creating a data model in BigQuery that will hold retail transaction data. Your two largest tables, sales_transation_header and sales_transation_line. have a tightly coupled immutable relationship. These tables are rarely modified after load and are frequently joined when queried. You need to model the sales_transation_header and sales_transation_line tables to improve the performance of data analytics queries.
What should you do?
- A . Create a sal es_transaction table that Stores the sales_tran3action_header and sales_transaction_line data as a JSON data type.
- B . Create a sale3_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields.
- C . Create a sale_transaction table that holds the sales_transaction_header and
sales_transaction_line information as rows, duplicating the sales_transaction_header data for each line. - D . Create separate sales_transation_header and sales_transation_line tables and. when querying, specify the sales transition line first in the WHERE clause.
B
Explanation:
BigQuery supports nested and repeated fields, which are complex data types that can represent hierarchical and one-to-many relationships within a single table. By using nested and repeated fields, you can denormalize your data model and reduce the number of joins required for your queries. This can improve the performance and efficiency of your data analytics queries, as joins can be expensive and require shuffling data across nodes. Nested and repeated fields also preserve the data integrity and avoid data duplication. In this scenario, the sales_transaction_header and sales_transaction_line tables have a tightly coupled immutable relationship, meaning that each header row corresponds to one or more line rows, and the data is rarely modified after load. Therefore, it makes sense to create a single sales_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields. This way, you can query the sales transaction data without joining two tables, and use dot notation or array functions to access the nested and repeated fields. For example, the sales_transaction table could have the following schema:
To query the total amount of each order, you could use the following SQL statement:
SQL
SELECT id, SUM(line_items.quantity * line_items.price) AS total_amount
FROM sales_transaction
GROUP BY id;
AI-generated code. Review and use carefully. More info on FAQ.
Reference: Use nested and repeated fields
BigQuery explained: Working with joins, nested & repeated data
Arrays in BigQuery ―.
How to improve query performance and optimise storage
You work for a large ecommerce company. You store your customers order data in Bigtable. You have a garbage collection policy set to delete the data after 30 days and the number of versions is set to 1. When the data analysts run a query to report total customer spending, the analysts sometimes see customer data that is older than 30 days. You need to ensure that the analysts do not see customer data older than 30 days while minimizing cost and overhead.
What should you do?
- A . Set the expiring values of the column families to 30 days and set the number of versions to 2.
- B . Use a timestamp range filter in the query to fetch the customer’s data for a specific range.
- C . Set the expiring values of the column families to 29 days and keep the number of versions to 1.
- D . Schedule a job daily to scan the data in the table and delete data older than 30 days.
B
Explanation:
By using a timestamp range filter in the query, you can ensure that the analysts only see the customer data that is within the desired time range, regardless of the garbage collection policy1. This option is the most cost-effective and simple way to avoid fetching data that is marked for deletion by garbage collection, as it does not require changing the existing policy or creating additional jobs. You can use the Bigtable client libraries or the cbt CLI to apply a timestamp range filter to your read requests2.
Option A is not effective, as it increases the number of versions to 2, which may cause more data to
be retained and increase the storage costs.
Option C is not reliable, as it reduces the expiring values to 29 days, which may not match the actual data arrival and usage patterns.
Option D is not efficient, as it requires scheduling a job daily to scan and delete the data, which may incur additional overhead and complexity. Moreover, none of these options guarantee that the data older than 30 days will be immediately deleted, as garbage collection is an asynchronous process that can take up to a week to remove the data3.
Reference:
1: Filters | Cloud Bigtable Documentation | Google Cloud
2: Read data | Cloud Bigtable Documentation | Google Cloud
3: Garbage collection overview | Cloud Bigtable Documentation | Google Cloud
You work for a large ecommerce company. You store your customers order data in Bigtable. You have a garbage collection policy set to delete the data after 30 days and the number of versions is set to 1. When the data analysts run a query to report total customer spending, the analysts sometimes see customer data that is older than 30 days. You need to ensure that the analysts do not see customer data older than 30 days while minimizing cost and overhead.
What should you do?
- A . Set the expiring values of the column families to 30 days and set the number of versions to 2.
- B . Use a timestamp range filter in the query to fetch the customer’s data for a specific range.
- C . Set the expiring values of the column families to 29 days and keep the number of versions to 1.
- D . Schedule a job daily to scan the data in the table and delete data older than 30 days.
B
Explanation:
By using a timestamp range filter in the query, you can ensure that the analysts only see the customer data that is within the desired time range, regardless of the garbage collection policy1. This option is the most cost-effective and simple way to avoid fetching data that is marked for deletion by garbage collection, as it does not require changing the existing policy or creating additional jobs. You can use the Bigtable client libraries or the cbt CLI to apply a timestamp range filter to your read requests2.
Option A is not effective, as it increases the number of versions to 2, which may cause more data to
be retained and increase the storage costs.
Option C is not reliable, as it reduces the expiring values to 29 days, which may not match the actual data arrival and usage patterns.
Option D is not efficient, as it requires scheduling a job daily to scan and delete the data, which may incur additional overhead and complexity. Moreover, none of these options guarantee that the data older than 30 days will be immediately deleted, as garbage collection is an asynchronous process that can take up to a week to remove the data3.
Reference:
1: Filters | Cloud Bigtable Documentation | Google Cloud
2: Read data | Cloud Bigtable Documentation | Google Cloud
3: Garbage collection overview | Cloud Bigtable Documentation | Google Cloud
You are designing a data warehouse in BigQuery to analyze sales data for a telecommunication service provider. You need to create a data model for customers, products, and subscriptions All customers, products, and subscriptions can be updated monthly, but you must maintain a historical record of all data. You plan to use the visualization layer for current and historical reporting. You need to ensure that the data model is simple, easy-to-use. and cost-effective.
What should you do?
- A . Create a normalized model with tables for each entity. Use snapshots before updates to track historical data
- B . Create a normalized model with tables for each entity. Keep all input files in a Cloud Storage bucket to track historical data
- C . Create a denormalized model with nested and repeated fields Update the table and use snapshots to track historical data
- D . Create a denormalized, append-only model with nested and repeated fields Use the ingestion timestamp to track historical data.
D
Explanation:
– A denormalized, append-only model simplifies query complexity by eliminating the need for joins. – Adding data with an ingestion timestamp allows for easy retrieval of both current and historical states. – Instead of updating records, new records are appended, which maintains historical information without the need to create separate snapshots.
