Practice Free Professional Data Engineer Exam Online Questions
After migrating ETL jobs to run on BigQuery, you need to verify that the output of the migrated jobs is the same as the output of the original. You’ve loaded a table containing the output of the original job and want to compare the contents with output from the migrated job to show that they are
identical. The tables do not contain a primary key column that would enable you to join them together for comparison.
What should you do?
- A . Select random samples from the tables using the RAND() function and compare the samples.
- B . Select random samples from the tables using the HASH() function and compare the samples.
- C . Use a Dataproc cluster and the BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table.
- D . Create stratified random samples using the OVER() function and compare equivalent samples from each table.
You are using BigQuery and Data Studio to design a customer-facing dashboard that displays large quantities of aggregated data. You expect a high volume of concurrent users. You need to optimize tie dashboard to provide quick visualizations with minimal latency.
What should you do?
- A . Use BigQuery BI Engine with materialized views
- B . Use BigQuery BI Engine with streaming data.
- C . Use BigQuery Bl Engine with authorized views
- D . Use BigQuery Bl Engine with logical reviews
You are using BigQuery and Data Studio to design a customer-facing dashboard that displays large quantities of aggregated data. You expect a high volume of concurrent users. You need to optimize tie dashboard to provide quick visualizations with minimal latency.
What should you do?
- A . Use BigQuery BI Engine with materialized views
- B . Use BigQuery BI Engine with streaming data.
- C . Use BigQuery Bl Engine with authorized views
- D . Use BigQuery Bl Engine with logical reviews
You have a variety of files in Cloud Storage that your data science team wants to use in their models Currently, users do not have a method to explore, cleanse, and validate the data in Cloud Storage. You are looking for a low code solution that can be used by your data science team to quickly cleanse and explore data within Cloud Storage.
What should you do?
- A . Load the data into BigQuery and use SQL to transform the data as necessary Provide the data science team access to staging tables to explore the raw data.
- B . Provide the data science team access to Dataflow to create a pipeline to prepare and validate the raw data and load data into BigQuery for data exploration.
- C . Provide the data science team access to Dataprep to prepare, validate, and explore the data within Cloud Storage.
- D . Create an external table in BigQuery and use SQL to transform the data as necessary Provide the data science team access to the external tables to explore the raw data.
C
Explanation:
Dataprep is a low code, serverless, and fully managed service that allows users to visually explore, cleanse, and validate data in Cloud Storage. It also provides features such as data profiling, data quality, data transformation, and data lineage. Dataprep is integrated with BigQuery, so users can easily export the prepared data to BigQuery for further analysis or modeling. Dataprep is a suitable solution for the data science team to quickly and easily work with the data in Cloud Storage, without having to write code or manage infrastructure. The other options are not as suitable as Dataprep for this use case, because they either require more coding, more infrastructure management, or more data movement. Loading the data into BigQuery, either directly or through Dataflow, would incur additional costs and latency, and may not provide the same level of data exploration and validation as Dataprep. Creating an external table in BigQuery would allow users to query the data in Cloud Storage, but would not provide the same level of data cleansing and transformation as Dataprep.
Reference: Dataprep overview
Dataprep features
Dataprep and BigQuery integration
You are using Cloud Bigtable to persist and serve stock market data for each of the major indices. To serve the trading application, you need to access only the most recent stock prices that are streaming in.
How should you design your row key and tables to ensure that you can access the data with the most simple query?
- A . Create one unique table for all of the indices, and then use the index and timestamp as the row key design
- B . Create one unique table for all of the indices, and then use a reverse timestamp as the row key design.
- C . For each index, have a separate table and use a timestamp as the row key design
- D . For each index, have a separate table and use a reverse timestamp as the row key design
Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a Cassandra cluster on Google Compute Engine. The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks. She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks.
What should you do?
- A . Run a local version of Jupiter on the laptop.
- B . Grant the user access to Google Cloud Shell.
- C . Host a visualization tool on a VM on Google Compute Engine.
- D . Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.
Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a Cassandra cluster on Google Compute Engine. The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks. She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks.
What should you do?
- A . Run a local version of Jupiter on the laptop.
- B . Grant the user access to Google Cloud Shell.
- C . Host a visualization tool on a VM on Google Compute Engine.
- D . Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.
Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a Cassandra cluster on Google Compute Engine. The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks. She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks.
What should you do?
- A . Run a local version of Jupiter on the laptop.
- B . Grant the user access to Google Cloud Shell.
- C . Host a visualization tool on a VM on Google Compute Engine.
- D . Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.
Which Java SDK class can you use to run your Dataflow programs locally?
- A . LocalRunner
- B . DirectPipelineRunner
- C . MachineRunner
- D . LocalPipelineRunner
B
Explanation:
DirectPipelineRunner allows you to execute operations in the pipeline directly, without any optimization. Useful for small local execution and tests
Reference: https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/runners/DirectPipelineRunner
You are deploying a batch pipeline in Dataflow. This pipeline reads data from Cloud Storage, transforms the data, and then writes the data into BigQuory. The security team has enabled an organizational constraint in Google Cloud, requiring all Compute Engine instances to use only internal IP addresses and no external IP addresses.
What should you do?
- A . Ensure that the firewall rules allow access to Cloud Storage and BigQuery. Use Dataflow with only
internal IPs. - B . Ensure that your workers have network tags to access Cloud Storage and BigQuery. Use Dataflow with only internal IP addresses.
- C . Create a VPC Service Controls perimeter that contains the VPC network and add Dataflow. Cloud Storage, and BigQuery as allowed
services in the perimeter. Use Dataflow with only internal IP addresses. - D . Ensure that Private Google Access is enabled in the subnetwork. Use Dataflow with only internal IP addresses.
D
Explanation:
To deploy a batch pipeline in Dataflow that adheres to the organizational constraint of using only internal IP addresses, ensuring Private Google Access is the most effective solution.
Here’s why option D is the best choice:
Private Google Access:
Private Google Access allows resources in a VPC network that do not have external IP addresses to access Google APIs and services through internal IP addresses.
This ensures compliance with the organizational constraint of using only internal IPs while allowing Dataflow to access Cloud Storage and BigQuery.
Dataflow with Internal IPs:
Dataflow can be configured to use only internal IP addresses for its worker nodes, ensuring that no external IP addresses are assigned.
This configuration ensures secure and compliant communication between Dataflow, Cloud Storage, and BigQuery.
Firewall and Network Configuration:
Enabling Private Google Access requires ensuring the correct firewall rules and network configurations to allow internal traffic to Google Cloud services.
Steps to Implement:
Enable Private Google Access:
Enable Private Google Access on the subnetwork used by the Dataflow pipeline
gcloud compute networks subnets update [SUBNET_NAME]
–region [REGION]
–enable-private-ip-google-access
Configure Dataflow:
Configure the Dataflow job to use only internal IP addresses
gcloud dataflow jobs run [JOB_NAME]
–region [REGION]
–network [VPC_NETWORK]
–subnetwork [SUBNETWORK]
–no-use-public-ips
Verify Access:
Ensure that firewall rules allow the necessary traffic from the Dataflow workers to Cloud Storage and BigQuery using internal IPs.
Reference: Links:
Private Google Access Documentation
Configuring Dataflow to Use Internal IPs
VPC Firewall Rules
