Practice Free Professional Data Engineer Exam Online Questions
Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.
Which approach should you take?
- A . Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.
- B . Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.
- C . Use the NOW () function in BigQuery to record the event’s time.
- D . Use the automatically generated timestamp from Cloud Pub/Sub to order the data.
Flowlogistic is rolling out their real-time inventory tracking system. The tracking devices will all send package-tracking messages, which will now go to a single Google Cloud Pub/Sub topic instead of the Apache Kafka cluster. A subscriber application will then process the messages for real-time reporting and store them in Google BigQuery for historical analysis. You want to ensure the package data can be analyzed over time.
Which approach should you take?
- A . Attach the timestamp on each message in the Cloud Pub/Sub subscriber application as they are received.
- B . Attach the timestamp and Package ID on the outbound message from each publisher device as they are sent to Clod Pub/Sub.
- C . Use the NOW () function in BigQuery to record the event’s time.
- D . Use the automatically generated timestamp from Cloud Pub/Sub to order the data.
If the tests pass, use Cloud Build to copy the code to the bucket of the production instance.
Which of the following statements about Legacy SQL and Standard SQL is not true?
- A . Standard SQL is the preferred query language for BigQuery.
- B . If you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
- C . One difference between the two query languages is how you specify fully-qualified table names (i.e. table names that include their associated project name).
- D . You need to set a query language for each dataset and the default is Standard SQL.
D
Explanation:
You do not set a query language for each dataset. It is set each time you run a query and the default query language is Legacy SQL.
Standard SQL has been the preferred query language since BigQuery 2.0 was released.
In legacy SQL, to query a table with a project-qualified name, you use a colon, :, as a separator. In standard SQL, you use a period, ., instead.
Due to the differences in syntax between the two query languages (such as with project-qualified table names), if you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
Reference: https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql
As your organization expands its usage of GCP, many teams have started to create their own projects. Projects are further multiplied to accommodate different stages of deployments and target audiences. Each project requires unique access control configurations. The central IT team needs to have access to all projects. Furthermore, data from Cloud Storage buckets and BigQuery datasets must be shared for use in other projects in an ad hoc way. You want to simplify access control management by minimizing the number of policies.
Which two steps should you take? Choose 2 answers.
- A . Use Cloud Deployment Manager to automate access provision.
- B . Introduce resource hierarchy to leverage access control policy inheritance.
- C . Create distinct groups for various teams, and specify groups in Cloud IAM policies.
- D . Only use service accounts when sharing data for Cloud Storage buckets and BigQuery datasets.
- E . For each Cloud Storage bucket or BigQuery dataset, decide which projects need access. Find all the active members who have access to these projects, and create a Cloud IAM policy to grant access to all these users.
As your organization expands its usage of GCP, many teams have started to create their own projects. Projects are further multiplied to accommodate different stages of deployments and target audiences. Each project requires unique access control configurations. The central IT team needs to have access to all projects. Furthermore, data from Cloud Storage buckets and BigQuery datasets must be shared for use in other projects in an ad hoc way. You want to simplify access control management by minimizing the number of policies.
Which two steps should you take? Choose 2 answers.
- A . Use Cloud Deployment Manager to automate access provision.
- B . Introduce resource hierarchy to leverage access control policy inheritance.
- C . Create distinct groups for various teams, and specify groups in Cloud IAM policies.
- D . Only use service accounts when sharing data for Cloud Storage buckets and BigQuery datasets.
- E . For each Cloud Storage bucket or BigQuery dataset, decide which projects need access. Find all the active members who have access to these projects, and create a Cloud IAM policy to grant access to all these users.
You are a retailer that wants to integrate your online sales capabilities with different in-home assistants, such as Google Home. You need to interpret customer voice commands and issue an order to the backend systems.
Which solutions should you choose?
- A . Cloud Speech-to-Text API
- B . Cloud Natural Language API
- C . Dialogflow Enterprise Edition
- D . Cloud AutoML Natural Language
You are administering shared BigQuery datasets that contain views used by multiple teams in your organization. The marketing team is concerned about the variability of their monthly BigQuery analytics spend using the on-demand billing model. You need to help the marketing team establish a consistent BigQuery analytics spend each month.
What should you do?
- A . Create a BigQuery Standard pay-as-you go reservation with a baseline of 0 slots and autoscaling set to 500 for the marketing team, and bill them back accordingly.
- B . Create a BigQuery reservation with a baseline of 500 slots with no autoscaling for the marketing team, and bill them back accordingly.
- C . Establish a BigQuery quota for the marketing team, and limit the maximum number of bytes scanned each day.
- D . Create a BigQuery Enterprise reservation with a baseline of 250 slots and autoscaling set to 500 for the marketing team, and bill them back accordingly.
B
Explanation:
To help the marketing team establish a consistent BigQuery analytics spend each month, you can use BigQuery reservations to allocate dedicated slots for their queries. This provides predictable costs by reserving a fixed amount of compute resources.
BigQuery Reservations:
BigQuery Reservations allow you to purchase dedicated query processing capacity in the form of slots.
By reserving slots, you can control costs and ensure that the marketing team has the necessary resources for their queries without unexpected increases in spending.
Baseline Slots:
Setting a baseline of 500 slots without autoscaling ensures a consistent allocation of resources.
This provides a predictable monthly cost, as the marketing team will be billed for the reserved slots regardless of actual usage.
Billing Back:
The marketing team’s usage can be billed back based on the fixed reservation cost, ensuring budget predictability.
This approach avoids the variability associated with on-demand billing, where costs can fluctuate based on query volume and complexity.
No Autoscaling:
By not enabling autoscaling, you prevent additional costs from being incurred due to temporary increases in query demand.
This fixed reservation ensures that the marketing team only uses the allocated 500 slots, maintaining a consistent monthly spend.
Google Data Engineer
Reference: BigQuery Reservations Documentation
BigQuery Slot Reservations
Managing BigQuery Costs
Using a fixed reservation of 500 slots provides the marketing team with predictable costs and the necessary resources for their queries without unexpected billing variability.
You are planning to migrate your current on-premises Apache Hadoop deployment to the cloud. You need to ensure that the deployment is as fault-tolerant and cost-effective as possible for long-running batch jobs. You want to use a managed service.
What should you do?
- A . Deploy a Cloud Dataproc cluster. Use a standard persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://
- B . Deploy a Cloud Dataproc cluster. Use an SSD persistent disk and 50% preemptible workers. Store data in Cloud Storage, and change references in scripts from hdfs:// to gs://
- C . Install Hadoop and Spark on a 10-node Compute Engine instance group with standard instances. Install the Cloud Storage connector, and store the data in Cloud Storage. Change references in scripts from hdfs:// to gs://
- D . Install Hadoop and Spark on a 10-node Compute Engine instance group with preemptible instances. Store data in HDFS. Change references in scripts from hdfs:// to gs://
Which methods can be used to reduce the number of rows processed by BigQuery?
- A . Splitting tables into multiple tables; putting data in partitions
- B . Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause
- C . Putting data in partitions; using the LIMIT clause
- D . Splitting tables into multiple tables; using the LIMIT clause
A
Explanation:
If you split a table into multiple tables (such as one table for each day), then you can limit your query to the data in specific tables (such as for particular days). A better method is to use a partitioned table, as long as your data can be separated by the day.
If you use the LIMIT clause, BigQuery will still process the entire table.
Reference: https://cloud.google.com/bigquery/docs/partitioned-tables
