Practice Free Professional Data Engineer Exam Online Questions
Your company is migrating its on-premises data warehousing solution to BigQuery. The existing data warehouse uses trigger-based change data capture (CDC) to apply daily updates from transactional database sources Your company wants to use BigQuery to improve its handling of CDC and to optimize the performance of the data warehouse Source system changes must be available for query m near-real time using tog-based CDC streams You need to ensure that changes in the BigQuery reporting table are available with minimal latency and reduced overhead.
What should you do? Choose 2 answers
- A . Perform a DML INSERT UPDATE, or DELETE to replicate each CDC record in the reporting table m real time.
- B . Periodically DELETE outdated records from the reporting table
Periodically use a DML MERGE to simultaneously perform DML INSERT. UPDATE, and DELETE operations in the reporting table - C . Insert each new CDC record and corresponding operation type into a staging table in real time
- D . Insert each new CDC record and corresponding operation type into the reporting table in real time and use a materialized view to expose only the current version of each unique record.
Topic 4, Main Questions Set B
Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs.
What should you recommend they do?
- A . Rewrite the job in Pig.
- B . Rewrite the job in Apache Spark.
- C . Increase the size of the Hadoop cluster.
- D . Decrease the size of the Hadoop cluster but also rewrite the job in Hive.
You need (o give new website users a globally unique identifier (GUID) using a service that takes in data points and returns a GUID This data is sourced from both internal and external systems via HTTP calls that you will make via microservices within your pipeline There will be tens of thousands of messages per second and that can be multithreaded, and you worry about the backpressure on the system.
How should you design your pipeline to minimize that backpressure?
- A . Call out to the service via HTTP
- B . Create the pipeline statically in the class definition
- C . Create a new object in the startBundle method of DoFn
- D . Batch the job into ten-second increments
Suppose you have a dataset of images that are each labeled as to whether or not they contain a human face.
To create a neural network that recognizes human faces in images using this labeled dataset, what approach would likely be the most effective?
- A . Use K-means Clustering to detect faces in the pixels.
- B . Use feature engineering to add features for eyes, noses, and mouths to the input data.
- C . Use deep learning by creating a neural network with multiple hidden layers to automatically detect features of faces.
- D . Build a neural network with an input layer of pixels, a hidden layer, and an output layer with two categories.
C
Explanation:
Traditional machine learning relies on shallow nets, composed of one input and one output layer, and at most one hidden layer in between. More than three layers (including input and output) qualifies as “deep” learning. So deep is a strictly defined, technical term that means more than one hidden layer.
In deep-learning networks, each layer of nodes trains on a distinct set of features based on the previous layer’s output. The further you advance into the neural net, the more complex the features your nodes can recognize, since they aggregate and recombine features from the previous layer.
A neural network with only one hidden layer would be unable to automatically recognize high-level features of faces, such as eyes, because it wouldn’t be able to "build" these features using previous hidden layers that detect low-level features, such as lines.
Feature engineering is difficult to perform on raw image data.
K-means Clustering is an unsupervised learning method used to categorize unlabeled data.
Reference: https://deeplearning4j.org/neuralnet-overview
You are running a pipeline in Cloud Dataflow that receives messages from a Cloud Pub/Sub topic and writes the results to a BigQuery dataset in the EU. Currently, your pipeline is located in europe-west4 and has a maximum of 3 workers, instance type n1-standard-1. You notice that during peak periods, your pipeline is struggling to process records in a timely fashion, when all 3 workers are at maximum CPU utilization.
Which two actions can you take to increase performance of your pipeline? (Choose two.)
- A . Increase the number of max workers
- B . Use a larger instance type for your Cloud Dataflow workers
- C . Change the zone of your Cloud Dataflow pipeline to run in us-central1
- D . Create a temporary table in Cloud Bigtable that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Bigtable to BigQuery
- E . Create a temporary table in Cloud Spanner that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Spanner to BigQuery
You are designing a Dataflow pipeline for a batch processing job. You want to mitigate multiple zonal failures at job submission time.
What should you do?
- A . Specify a worker region by using the ―region flag.
- B . Set the pipeline staging location as a regional Cloud Storage bucket.
- C . Submit duplicate pipelines in two different zones by using the ―zone flag.
- D . Create an Eventarc trigger to resubmit the job in case of zonal failure when submitting the job.
B
Explanation:
By specifying a worker region, you can run your Dataflow pipeline in a multi-zone or multi-region configuration, which provides higher availability and resilience in case of zonal failures1. The ― region flag allows you to specify the regional endpoint for your pipeline, which determines the location of the Dataflow service and the default location of the Compute Engine resources1. If you do not specify a zone by using the ―zone flag, Dataflow automatically selects a zone within the region for your job workers1. This option is recommended over submitting duplicate pipelines in two different zones, which would incur additional costs and complexity. Setting the pipeline staging location as a regional Cloud Storage bucket does not affect the availability of your pipeline, as the staging location only stores the pipeline code and dependencies2. Creating an Eventarc trigger to resubmit the job in case of zonal failure is not a reliable solution, as it depends on the availability of the Eventarc service and the zonal resources at the time of resubmission.
Reference:
1: Pipeline troubleshooting and debugging | Cloud Dataflow | Google Cloud
3: Regional endpoints | Cloud Dataflow | Google Cloud
You are designing a Dataflow pipeline for a batch processing job. You want to mitigate multiple zonal failures at job submission time.
What should you do?
- A . Specify a worker region by using the ―region flag.
- B . Set the pipeline staging location as a regional Cloud Storage bucket.
- C . Submit duplicate pipelines in two different zones by using the ―zone flag.
- D . Create an Eventarc trigger to resubmit the job in case of zonal failure when submitting the job.
B
Explanation:
By specifying a worker region, you can run your Dataflow pipeline in a multi-zone or multi-region configuration, which provides higher availability and resilience in case of zonal failures1. The ― region flag allows you to specify the regional endpoint for your pipeline, which determines the location of the Dataflow service and the default location of the Compute Engine resources1. If you do not specify a zone by using the ―zone flag, Dataflow automatically selects a zone within the region for your job workers1. This option is recommended over submitting duplicate pipelines in two different zones, which would incur additional costs and complexity. Setting the pipeline staging location as a regional Cloud Storage bucket does not affect the availability of your pipeline, as the staging location only stores the pipeline code and dependencies2. Creating an Eventarc trigger to resubmit the job in case of zonal failure is not a reliable solution, as it depends on the availability of the Eventarc service and the zonal resources at the time of resubmission.
Reference:
1: Pipeline troubleshooting and debugging | Cloud Dataflow | Google Cloud
3: Regional endpoints | Cloud Dataflow | Google Cloud
You are designing a Dataflow pipeline for a batch processing job. You want to mitigate multiple zonal failures at job submission time.
What should you do?
- A . Specify a worker region by using the ―region flag.
- B . Set the pipeline staging location as a regional Cloud Storage bucket.
- C . Submit duplicate pipelines in two different zones by using the ―zone flag.
- D . Create an Eventarc trigger to resubmit the job in case of zonal failure when submitting the job.
B
Explanation:
By specifying a worker region, you can run your Dataflow pipeline in a multi-zone or multi-region configuration, which provides higher availability and resilience in case of zonal failures1. The ― region flag allows you to specify the regional endpoint for your pipeline, which determines the location of the Dataflow service and the default location of the Compute Engine resources1. If you do not specify a zone by using the ―zone flag, Dataflow automatically selects a zone within the region for your job workers1. This option is recommended over submitting duplicate pipelines in two different zones, which would incur additional costs and complexity. Setting the pipeline staging location as a regional Cloud Storage bucket does not affect the availability of your pipeline, as the staging location only stores the pipeline code and dependencies2. Creating an Eventarc trigger to resubmit the job in case of zonal failure is not a reliable solution, as it depends on the availability of the Eventarc service and the zonal resources at the time of resubmission.
Reference:
1: Pipeline troubleshooting and debugging | Cloud Dataflow | Google Cloud
3: Regional endpoints | Cloud Dataflow | Google Cloud
One of your encryption keys stored in Cloud Key Management Service (Cloud KMS) was exposed. You need to re-encrypt all of your CMEK-protected Cloud Storage data that used that key. and then delete the compromised key. You also want to reduce the risk of objects getting written without customer-managed encryption key (CMEK protection in the future.
What should you do?
- A . Rotate the Cloud KMS key version. Continue to use the same Cloud Storage bucket.
- B . Create a new Cloud KMS key. Set the default CMEK key on the existing Cloud Storage bucket to the new one.
- C . Create a new Cloud KMS key. Create a new Cloud Storage bucket. Copy all objects from the old bucket to the new one bucket while specifying the new Cloud KMS key in the copy command.
- D . Create a new Cloud KMS key. Create a new Cloud Storage bucket configured to use the new key as the default CMEK key. Copy all objects from the old bucket to the new bucket without specifying a key.
C
Explanation:
To re-encrypt all of your CMEK-protected Cloud Storage data after a key has been exposed, and to ensure future writes are protected with a new key, creating a new Cloud KMS key and a new Cloud Storage bucket is the best approach.
Here’s why option C is the best choice:
Re-encryption of Data:
By creating a new Cloud Storage bucket and copying all objects from the old bucket to the new bucket while specifying the new Cloud KMS key, you ensure that all data is re-encrypted with the new key.
This process effectively re-encrypts the data, removing any dependency on the compromised key.
Ensuring CMEK Protection:
Creating a new bucket and setting the new CMEK as the default ensures that all future objects written to the bucket are automatically protected with the new key.
This reduces the risk of objects being written without CMEK protection.
Deletion of Compromised Key:
Once the data has been copied and re-encrypted, the old key can be safely deleted from Cloud KMS, eliminating the risk associated with the compromised key.
Steps to Implement:
Create a New Cloud KMS Key:
Create a new encryption key in Cloud KMS to replace the compromised key.
Create a New Cloud Storage Bucket:
Create a new Cloud Storage bucket and set the default CMEK to the new key.
Copy and Re-encrypt Data:
Use the gsutil tool to copy data from the old bucket to the new bucket while specifying the new
CMEK key:
gsutil -o "GSUtil:gs_json_api_version=2" cp -r gs://old-bucket/* gs://new-bucket/
Delete the Old Key:
After ensuring all data is copied and re-encrypted, delete the compromised key from Cloud KMS.
Reference: Links:
Cloud KMS Documentation
Cloud Storage Encryption
Re-encrypting Data in Cloud Storage
You create an important report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old.
What should you do?
- A . Disable caching by editing the report settings.
- B . Disable caching in BigQuery by editing table details.
- C . Refresh your browser tab showing the visualizations.
- D . Clear your browser history for the past hour then reload the tab showing the virtualizations.
A
Explanation:
Reference: https://support.google.com/datastudio/answer/7020039?hl=en
