Practice Free Professional Data Engineer Exam Online Questions
You are deploying a batch pipeline in Dataflow. This pipeline reads data from Cloud Storage, transforms the data, and then writes the data into BigQuory. The security team has enabled an organizational constraint in Google Cloud, requiring all Compute Engine instances to use only internal IP addresses and no external IP addresses.
What should you do?
- A . Ensure that the firewall rules allow access to Cloud Storage and BigQuery. Use Dataflow with only
internal IPs. - B . Ensure that your workers have network tags to access Cloud Storage and BigQuery. Use Dataflow with only internal IP addresses.
- C . Create a VPC Service Controls perimeter that contains the VPC network and add Dataflow. Cloud Storage, and BigQuery as allowed
services in the perimeter. Use Dataflow with only internal IP addresses. - D . Ensure that Private Google Access is enabled in the subnetwork. Use Dataflow with only internal IP addresses.
D
Explanation:
To deploy a batch pipeline in Dataflow that adheres to the organizational constraint of using only internal IP addresses, ensuring Private Google Access is the most effective solution.
Here’s why option D is the best choice:
Private Google Access:
Private Google Access allows resources in a VPC network that do not have external IP addresses to access Google APIs and services through internal IP addresses.
This ensures compliance with the organizational constraint of using only internal IPs while allowing Dataflow to access Cloud Storage and BigQuery.
Dataflow with Internal IPs:
Dataflow can be configured to use only internal IP addresses for its worker nodes, ensuring that no external IP addresses are assigned.
This configuration ensures secure and compliant communication between Dataflow, Cloud Storage, and BigQuery.
Firewall and Network Configuration:
Enabling Private Google Access requires ensuring the correct firewall rules and network configurations to allow internal traffic to Google Cloud services.
Steps to Implement:
Enable Private Google Access:
Enable Private Google Access on the subnetwork used by the Dataflow pipeline
gcloud compute networks subnets update [SUBNET_NAME]
–region [REGION]
–enable-private-ip-google-access
Configure Dataflow:
Configure the Dataflow job to use only internal IP addresses
gcloud dataflow jobs run [JOB_NAME]
–region [REGION]
–network [VPC_NETWORK]
–subnetwork [SUBNETWORK]
–no-use-public-ips
Verify Access:
Ensure that firewall rules allow the necessary traffic from the Dataflow workers to Cloud Storage and BigQuery using internal IPs.
Reference: Links:
Private Google Access Documentation
Configuring Dataflow to Use Internal IPs
VPC Firewall Rules
You are implementing workflow pipeline scheduling using open source-based tools and Google Kubernetes Engine (GKE). You want to use a Google managed service to simplify and automate the
task. You also want to accommodate Shared VPC networking considerations.
What should you do?
- A . Use Dataflow for your workflow pipelines. Use Cloud Run triggers for scheduling.
- B . Use Dataflow for your workflow pipelines. Use shell scripts to schedule workflows.
- C . Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the host project.
- D . Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the service project.
D
Explanation:
Shared VPC requires that you designate a host project to which networks and subnetworks belong and a service project, which is attached to the host project. When Cloud Composer participates in a Shared VPC, the Cloud Composer environment is in the service project.
Reference: https://cloud.google.com/composer/docs/how-to/managing/configuring-shared-vpc
You need to compose visualization for operations teams with the following requirements:
Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every minute)
The report must not be more than 3 hours delayed from live data.
The actionable report should only show suboptimal links.
Most suboptimal links should be sorted to the top.
Suboptimal links can be grouped and filtered by regional geography.
User response time to load the report must be <5 seconds.
You create a data source to store the last 6 weeks of data, and create visualizations that allow viewers to see multiple date ranges, distinct geographic regions, and unique installation types. You always show the latest data without any changes to your visualizations. You want to avoid creating and updating new visualizations each month.
What should you do?
- A . Look through the current data and compose a series of charts and tables, one for each possible
combination of criteria. - B . Look through the current data and compose a small set of generalized charts and tables bound to criteria filters that allow value selection.
- C . Export the data to a spreadsheet, compose a series of charts and tables, one for each possible combination of criteria, and spread them across multiple tabs.
- D . Load the data into relational database tables, write a Google App Engine application that queries all rows, summarizes the data across each criteria, and then renders results using the Google Charts and visualization API.
Create a BigQuery external table on the exported files.
Explanation:
This option will allow you to store the data in a low-cost storage option, as the archive storage class has the lowest price per GB among the Cloud Storage classes. It will also ensure that the data is immutable for 3 years, as the locked retention policy prevents the deletion or overwriting of the data until the retention period expires. You can still query the data using SQL by creating a BigQuery external table that references the exported files in the Cloud Storage bucket.
Option A is incorrect because creating a BigQuery table clone will not reduce the storage costs, as the clone will have the same size and storage class as the original table.
Option B is incorrect because creating a BigQuery table snapshot will also not reduce the storage costs, as the snapshot will have the same size and storage class as the original table.
Option C is incorrect because enabling versioning on the bucket will not make the data immutable, as the versions can still be deleted or overwritten by anyone with the appropriate permissions. It will also increase the storage costs, as each version of the file will be charged separately.
Reference: Exporting table data | BigQuery | Google Cloud
Storage classes | Cloud Storage | Google Cloud
Retention policies and retention periods | Cloud Storage | Google Cloud
Federated queries | BigQuery | Google Cloud
Create a BigQuery external table on the exported files.
Explanation:
This option will allow you to store the data in a low-cost storage option, as the archive storage class has the lowest price per GB among the Cloud Storage classes. It will also ensure that the data is immutable for 3 years, as the locked retention policy prevents the deletion or overwriting of the data until the retention period expires. You can still query the data using SQL by creating a BigQuery external table that references the exported files in the Cloud Storage bucket.
Option A is incorrect because creating a BigQuery table clone will not reduce the storage costs, as the clone will have the same size and storage class as the original table.
Option B is incorrect because creating a BigQuery table snapshot will also not reduce the storage costs, as the snapshot will have the same size and storage class as the original table.
Option C is incorrect because enabling versioning on the bucket will not make the data immutable, as the versions can still be deleted or overwritten by anyone with the appropriate permissions. It will also increase the storage costs, as each version of the file will be charged separately.
Reference: Exporting table data | BigQuery | Google Cloud
Storage classes | Cloud Storage | Google Cloud
Retention policies and retention periods | Cloud Storage | Google Cloud
Federated queries | BigQuery | Google Cloud
You need to create a near real-time inventory dashboard that reads the main inventory tables in your BigQuery data warehouse. Historical inventory data is stored as inventory balances by item and location. You have several thousand updates to inventory every hour. You want to maximize performance of the dashboard and ensure that the data is accurate.
What should you do?
- A . Leverage BigQuery UPDATE statements to update the inventory balances as they are changing.
- B . Partition the inventory balance table by item to reduce the amount of data scanned with each inventory update.
- C . Use the BigQuery streaming the stream changes into a daily inventory movement table. Calculate balances in a view that joins it to the historical inventory balance table. Update the inventory balance table nightly.
- D . Use the BigQuery bulk loader to batch load inventory changes into a daily inventory movement table. Calculate balances in a view that joins it to the historical inventory balance table. Update the inventory balance table nightly.
You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples.
Which two characteristic support this method? (Choose two.)
- A . There are very few occurrences of mutations relative to normal samples.
- B . There are roughly equal occurrences of both normal and mutated samples in the database.
- C . You expect future mutations to have different features from the mutated samples in the database.
- D . You expect future mutations to have similar features to the mutated samples in the database.
- E . You already have labels for which samples are mutated and which are normal in the database.
AD
Explanation:
Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. https://en.wikipedia.org/wiki/Anomaly_detection
You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples.
Which two characteristic support this method? (Choose two.)
- A . There are very few occurrences of mutations relative to normal samples.
- B . There are roughly equal occurrences of both normal and mutated samples in the database.
- C . You expect future mutations to have different features from the mutated samples in the database.
- D . You expect future mutations to have similar features to the mutated samples in the database.
- E . You already have labels for which samples are mutated and which are normal in the database.
AD
Explanation:
Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. https://en.wikipedia.org/wiki/Anomaly_detection
You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples.
Which two characteristic support this method? (Choose two.)
- A . There are very few occurrences of mutations relative to normal samples.
- B . There are roughly equal occurrences of both normal and mutated samples in the database.
- C . You expect future mutations to have different features from the mutated samples in the database.
- D . You expect future mutations to have similar features to the mutated samples in the database.
- E . You already have labels for which samples are mutated and which are normal in the database.
AD
Explanation:
Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. https://en.wikipedia.org/wiki/Anomaly_detection
You are planning to use Google’s Dataflow SDK to analyze customer data such as displayed below. Your project requirement is to extract only the customer name from the data source and then write to an output PCollection.
Tom,555 X street
Tim,553 Y street
Sam, 111 Z street
Which operation is best suited for the above data processing requirement?
- A . ParDo
- B . Sink API
- C . Source API
- D . Data extraction
A
Explanation:
In Google Cloud dataflow SDK, you can use the ParDo to extract only a customer name of each element in your PCollection.
Reference: https://cloud.google.com/dataflow/model/par-do
