Practice Free Professional Data Engineer Exam Online Questions
You are migrating a table to BigQuery and are deeding on the data model. Your table stores information related to purchases made across several store locations and includes information like the time of the transaction, items purchased, the store ID and the city and state in which the store is located You frequently query this table to see how many of each item were sold over the past 30 days and to look at purchasing trends by state city and individual store. You want to model this table to minimize query time and cost.
What should you do?
- A . Partition by transaction time; cluster by state first, then city then store ID
- B . Partition by transaction tome cluster by store ID first, then city, then stale
- C . Top-level cluster by stale first, then city then store
- D . Top-level cluster by store ID first, then city then state.
Your software uses a simple JSON format for all messages. These messages are published to Google Cloud Pub/Sub, then processed with Google Cloud Dataflow to create a real-time dashboard for the CFO. During testing, you notice that some messages are missing in the dashboard. You check the logs, and all messages are being published to Cloud Pub/Sub successfully.
What should you do next?
- A . Check the dashboard application to see if it is not displaying correctly.
- B . Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output.
- C . Use Google Stackdriver Monitoring on Cloud Pub/Sub to find the missing messages.
- D . Switch Cloud Dataflow to pull messages from Cloud Pub/Sub instead of Cloud Pub/Sub pushing messages to Cloud Dataflow.
Your software uses a simple JSON format for all messages. These messages are published to Google Cloud Pub/Sub, then processed with Google Cloud Dataflow to create a real-time dashboard for the CFO. During testing, you notice that some messages are missing in the dashboard. You check the logs, and all messages are being published to Cloud Pub/Sub successfully.
What should you do next?
- A . Check the dashboard application to see if it is not displaying correctly.
- B . Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output.
- C . Use Google Stackdriver Monitoring on Cloud Pub/Sub to find the missing messages.
- D . Switch Cloud Dataflow to pull messages from Cloud Pub/Sub instead of Cloud Pub/Sub pushing messages to Cloud Dataflow.
Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?
- A . Field promotion
- B . Randomization
- C . Salting
- D . Hashing
A
Explanation:
By default, prefer field promotion. Field promotion avoids hotspotting in almost all cases, and it tends to make it easier to design a row key that facilitates queries.
Reference: https://cloud.google.com/bigtable/docs/schema-design-time-series#ensure_that_your_row_key_avoids_hotspotting
You are developing a model to identify the factors that lead to sales conversions for your customers. You have completed processing your data. You want to continue through the model development lifecycle.
What should you do next?
- A . Use your model to run predictions on fresh customer input data.
- B . Test and evaluate your model on your curated data to determine how well the model performs.
- C . Monitor your model performance, and make any adjustments needed.
- D . Delineate what data will be used for testing and what will be used for training the model.
D
Explanation:
After processing your data, the next step in the model development lifecycle is to test and evaluate your model on the curated data. This is crucial to determine the performance of the model and to understand how well it can predict sales conversions for your customers. The evaluation phase involves using various metrics and techniques to assess the accuracy, precision, recall, and other relevant performance indicators of the model. It helps in identifying any issues or areas for improvement before deploying the model in a production environment.
Reference: The information provided here is verified by the Google Professional Data Engineer Certification Exam Guide and related resources, which outline the steps and best practices in the model development lifecycle
You are developing a model to identify the factors that lead to sales conversions for your customers. You have completed processing your data. You want to continue through the model development lifecycle.
What should you do next?
- A . Use your model to run predictions on fresh customer input data.
- B . Test and evaluate your model on your curated data to determine how well the model performs.
- C . Monitor your model performance, and make any adjustments needed.
- D . Delineate what data will be used for testing and what will be used for training the model.
D
Explanation:
After processing your data, the next step in the model development lifecycle is to test and evaluate your model on the curated data. This is crucial to determine the performance of the model and to understand how well it can predict sales conversions for your customers. The evaluation phase involves using various metrics and techniques to assess the accuracy, precision, recall, and other relevant performance indicators of the model. It helps in identifying any issues or areas for improvement before deploying the model in a production environment.
Reference: The information provided here is verified by the Google Professional Data Engineer Certification Exam Guide and related resources, which outline the steps and best practices in the model development lifecycle
Which of the following are feature engineering techniques? (Select 2 answers)
- A . Hidden feature layers
- B . Feature prioritization
- C . Crossed feature columns
- D . Bucketization of a continuous feature
CD
Explanation:
Selecting and crafting the right set of feature columns is key to learning an effective model.
Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive bins/buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into.
Using each base feature column separately may not be enough to explain the data. To learn the differences between different feature combinations, we can add crossed feature columns to the model.
Reference: https://www.tensorflow.org/tutorials/wide#selecting_and_engineering_features_for_the_model
You are managing a Cloud Dataproc cluster. You need to make a job run faster while minimizing costs, without losing work in progress on your clusters.
What should you do?
- A . Increase the cluster size with more non-preemptible workers.
- B . Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.
- C . Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a script to preserve work.
- D . Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.
D
Explanation:
Reference: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/flex
You are managing a Cloud Dataproc cluster. You need to make a job run faster while minimizing costs, without losing work in progress on your clusters.
What should you do?
- A . Increase the cluster size with more non-preemptible workers.
- B . Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.
- C . Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a script to preserve work.
- D . Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.
D
Explanation:
Reference: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/flex
You are managing a Cloud Dataproc cluster. You need to make a job run faster while minimizing costs, without losing work in progress on your clusters.
What should you do?
- A . Increase the cluster size with more non-preemptible workers.
- B . Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.
- C . Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a script to preserve work.
- D . Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.
D
Explanation:
Reference: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/flex
