Practice Free Professional Data Engineer Exam Online Questions
Your analytics team wants to build a simple statistical model to determine which customers are most likely to work with your company again, based on a few different metrics. They want to run the model on Apache Spark, using data housed in Google Cloud Storage, and you have recommended using Google Cloud Dataproc to execute this job. Testing has shown that this workload can run in approximately 30 minutes on a 15-node cluster, outputting the results into Google BigQuery. The plan is to run this workload weekly.
How should you optimize the cluster for cost?
- A . Migrate the workload to Google Cloud Dataflow
- B . Use pre-emptible virtual machines (VMs) for the cluster
- C . Use a higher-memory node so that the job runs faster
- D . Use SSDs on the worker nodes so that the job can run faster
Your analytics team wants to build a simple statistical model to determine which customers are most likely to work with your company again, based on a few different metrics. They want to run the model on Apache Spark, using data housed in Google Cloud Storage, and you have recommended using Google Cloud Dataproc to execute this job. Testing has shown that this workload can run in approximately 30 minutes on a 15-node cluster, outputting the results into Google BigQuery. The plan is to run this workload weekly.
How should you optimize the cluster for cost?
- A . Migrate the workload to Google Cloud Dataflow
- B . Use pre-emptible virtual machines (VMs) for the cluster
- C . Use a higher-memory node so that the job runs faster
- D . Use SSDs on the worker nodes so that the job can run faster
Your company is planning to migrate a large on-premises data warehouse to BigQuery. The data is currently stored in a proprietary, vendor-specific format. You need to perform a batch migration of this data to BigQuery.
What should you do?
- A . Use the bq command-line tool to load the data directly from the on-premises data warehouse.
- B . Use the BigQuery Data Transfer Service.
- C . Export the data to CSV files, upload the files to Cloud Storage, then load the files into BigQuery.
- D . Use Datastream to replicate the data in real time.
C
Explanation:
Comprehensive and Detailed
The challenge here is dealing with a "proprietary, vendor-specific format" for a one-time batch migration.
Option C is the correct answer because it represents the most universal and reliable pattern for migration from any source system. By first exporting the data into a standard, interoperable format like CSV (or preferably, a self-describing format like Avro or Parquet), you decouple the process from the proprietary source. These standard files can then be easily uploaded to Cloud Storage (the recommended staging area for BigQuery loads) and loaded into BigQuery in a highly performant and parallelized manner.
Option A is incorrect because the bq command-line tool cannot connect directly to an on-premises data warehouse to pull data. It loads data from files or streams.
Option B is incorrect because the BigQuery Data Transfer Service (DTS) has connectors for specific, common data sources (like Teradata, Redshift, S3). It is unlikely to have a connector for a generic "proprietary, vendor-specific format."
Option D is incorrect because Datastream is a Change Data Capture (CDC) service designed for real-time replication of databases, not for a large-scale, one-time batch migration of a data warehouse.
Reference (Google Cloud Documentation Concepts): Google Cloud’s "Data warehouse migration to BigQuery" guide outlines several migration strategies. For batch data transfer, the recommended path is Extract, Transfer, Load (ETL). This involves extracting data from the source into files (in formats like CSV, Avro, Parquet), transferring those files to Cloud Storage, and then loading them into BigQuery. This approach is recommended for its reliability and compatibility with any source system
You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data.
What should you do?
- A . Add a SideInput that returns a Boolean if the element is corrupt.
- B . Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
- C . Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
- D . Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.
Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster.
What should you do?
- A . Create a Google Cloud Dataflow job to process the data.
- B . Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.
- C . Create a Hadoop cluster on Google Compute Engine that uses persistent disks.
- D . Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
- E . Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.
Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster.
What should you do?
- A . Create a Google Cloud Dataflow job to process the data.
- B . Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.
- C . Create a Hadoop cluster on Google Compute Engine that uses persistent disks.
- D . Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
- E . Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.
You are developing a software application using Google’s Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline.
Which component will be used for the data processing operation?
- A . PCollection
- B . Transform
- C . Pipeline
- D . Sink API
B
Explanation:
In Google Cloud, the Dataflow SDK provides a transform component. It is responsible for the data processing operation. You can use conditional, for loops, and other complex programming structure to create a branching pipeline.
Reference: https://cloud.google.com/dataflow/model/programming-model
Your weather app queries a database every 15 minutes to get the current temperature. The frontend is powered by Google App Engine and server millions of users.
How should you design the frontend to respond to a database failure?
- A . Issue a command to restart the database servers.
- B . Retry the query with exponential backoff, up to a cap of 15 minutes.
- C . Retry the query every second until it comes back online to minimize staleness of data.
- D . Reduce the query frequency to once every hour until the database comes back online.
B
Explanation:
https://cloud.google.com/sql/docs/mysql/manage-connections#backoff
Your weather app queries a database every 15 minutes to get the current temperature. The frontend is powered by Google App Engine and server millions of users.
How should you design the frontend to respond to a database failure?
- A . Issue a command to restart the database servers.
- B . Retry the query with exponential backoff, up to a cap of 15 minutes.
- C . Retry the query every second until it comes back online to minimize staleness of data.
- D . Reduce the query frequency to once every hour until the database comes back online.
B
Explanation:
https://cloud.google.com/sql/docs/mysql/manage-connections#backoff
How can you get a neural network to learn about relationships between categories in a categorical feature?
- A . Create a multi-hot column
- B . Create a one-hot column
- C . Create a hash bucket
- D . Create an embedding column
D
Explanation:
There are two problems with one-hot encoding. First, it has high dimensionality, meaning that instead of having just one value, like a continuous feature, it has many values, or dimensions. This makes computation more time-consuming, especially if a feature has a very large number of categories. The second problem is that it doesn’t encode any relationships between the categories. They are completely independent from each other, so the network has no way of knowing which ones are similar to each other.
Both of these problems can be solved by representing a categorical feature with an embedding column. The idea is that each category has a smaller vector with, let’s say, 5 values in it. But unlike a one-hot vector, the values are not usually 0. The values are weights, similar to the weights that are used for basic features in a neural network. The difference is that each category has a set of weights (5 of them in this case).
You can think of each value in the embedding vector as a feature of the category. So, if two categories are very similar to each other, then their embedding vectors should be very similar too.
Reference: https://cloudacademy.com/google/introduction-to-google-cloud-machine-learning-engine-course/a-wide-and-deep-model.html
