Practice Free Professional Data Engineer Exam Online Questions
You want to use Google Stackdriver Logging to monitor Google BigQuery usage. You need an instant notification to be sent to your monitoring tool when new data is appended to a certain table using an insert job, but you do not want to receive notifications for other tables.
What should you do?
- A . Make a call to the Stackdriver API to list all logs, and apply an advanced filter.
- B . In the Stackdriver logging admin interface, and enable a log sink export to BigQuery.
- C . In the Stackdriver logging admin interface, enable a log sink export to Google Cloud Pub/Sub, and subscribe to the topic from your monitoring tool.
- D . Using the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.
You have a data processing application that runs on Google Kubernetes Engine (GKE). Containers need to be launched with their latest available configurations from a container registry. Your GKE nodes need to have GPUs. local SSDs, and 8 Gbps bandwidth. You want to efficiently provision the data processing infrastructure and manage the deployment process.
What should you do?
- A . Use Compute Engi.no startup scriots to pull container Images, and use gloud commands to provision the infrastructure.
- B . Use GKE to autoscale containers, and use gloud commands to provision the infrastructure.
- C . Use Cloud Build to schedule a job using Terraform build to provision the infrastructure and launch with the most current container images.
- D . Use Dataflow to provision the data pipeline, and use Cloud Scheduler to run the job.
C
Explanation:
https://cloud.google.com/architecture/managing-infrastructure-as-code
You need to choose a database for a new project that has the following requirements:
Fully managed
Able to automatically scale up
Transactionally consistent
Able to scale up to 6 TB
Able to be queried using SQL
Which database do you choose?
- A . Cloud SQL
- B . Cloud Bigtable
- C . Cloud Spanner
- D . Cloud Datastore
You need to choose a database for a new project that has the following requirements:
Fully managed
Able to automatically scale up
Transactionally consistent
Able to scale up to 6 TB
Able to be queried using SQL
Which database do you choose?
- A . Cloud SQL
- B . Cloud Bigtable
- C . Cloud Spanner
- D . Cloud Datastore
Which of the following are examples of hyperparameters? (Select 2 answers.)
- A . Number of hidden layers
- B . Number of nodes in each hidden layer
- C . Biases
- D . Weights
AB
Explanation:
If model parameters are variables that get adjusted by training with existing data, your hyperparameters are the variables about the training process itself. For example, part of setting up a deep neural network is deciding how many "hidden" layers of nodes to use between the input layer and the output layer, as well as how many nodes each layer should use. These variables are not directly related to the training data at all. They are configuration variables. Another difference is that parameters change during a training job, while the hyperparameters are usually constant during a job.
Weights and biases are variables that get adjusted during the training process, so they are not hyperparameters.
Reference: https://cloud.google.com/ml-engine/docs/hyperparameter-tuning-overview
You want to rebuild your batch pipeline for structured data on Google Cloud You are using PySpark to conduct data transformations at scale, but your pipelines are taking over twelve hours to run. To expedite development and pipeline run time, you want to use a serverless tool and SQL syntax You have already moved your raw data into Cloud Storage.
How should you build the pipeline on Google Cloud while meeting speed and processing requirements?
- A . Convert your PySpark commands into SparkSQL queries to transform the data; and then run your pipeline on Dataproc to write the data into BigQuery
- B . Ingest your data into Cloud SQL, convert your PySpark commands into SparkSQL queries to transform the data, and then use federated queries from BigQuery for machine learning.
- C . Ingest your data into BigQuery from Cloud Storage, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table
- D . Use Apache Beam Python SDK to build the transformation pipelines, and write the data into BigQuery
You are implementing security best practices on your data pipeline. Currently, you are manually executing jobs as the Project Owner. You want to automate these jobs by taking nightly batch files containing non-public information from Google Cloud Storage, processing them with a Spark Scala job on a Google Cloud Dataproc cluster, and depositing the results into Google BigQuery.
How should you securely run this workload?
- A . Restrict the Google Cloud Storage bucket so only you can see the files
- B . Grant the Project Owner role to a service account, and run the job with it
- C . Use a service account with the ability to read the batch files and to write to BigQuery
- D . Use a user account with the Project Viewer role on the Cloud Dataproc cluster to read the batch files and write to BigQuery
You want to create a machine learning model using BigQuery ML and create an endpoint foe hosting the model using Vertex Al. This will enable the processing of continuous streaming data in near-real time from multiple vendors. The data may contain invalid values.
What should you do?
- A . Create a new BigOuery dataset and use streaming inserts to land the data from multiple vendors.
Configure your BigQuery ML model to use the "ingestion’ dataset as the training data. - B . Use BigQuery streaming inserts to land the data from multiple vendors whore your BigQuery dataset ML model is deployed.
- C . Create a Pub’Sub topic and send all vendor data to it Connect a Cloud Function to the topic to process the data and store it in BigQuery.
- D . Create a Pub/Sub topic and send all vendor data to it Use Dataflow to process and sanitize the Pub/Sub data and stream it to BigQuery.
D
Explanation:
Dataflow provides a scalable and flexible way to process and clean the incoming data in real-time before loading it into BigQuery.
You are administering a BigQuery on-demand environment. Your business intelligence tool is submitting hundreds of queries each day that aggregate a large (50 TB) sales history fact table at the day and month levels. These queries have a slow response time and are exceeding cost expectations. You need to decrease response time, lower query costs, and minimize maintenance.
What should you do?
- A . Build materialized views on top of the sales table to aggregate data at the day and month level.
- B . Build authorized views on top of the sales table to aggregate data at the day and month level.
- C . Enable Bl Engine and add your sales table as a preferred table.
- D . Create a scheduled query to build sales day and sales month aggregate tables on an hourly basis.
A
Explanation:
To improve response times and reduce costs for frequent queries aggregating a large sales history fact table, materialized views are a highly effective solution. Here’s why option A is the best choice:
Materialized Views:
Materialized views store the results of a query physically and update them periodically, offering faster query responses for frequently accessed data.
They are designed to improve performance for repetitive and expensive aggregation queries by precomputing the results.
Efficiency and Cost Reduction:
By building materialized views at the day and month level, you significantly reduce the computation required for each query, leading to faster response times and lower query costs.
Materialized views also reduce the need for on-demand query execution, which can be costly when
dealing with large datasets.
Minimized Maintenance:
Materialized views in BigQuery are managed automatically, with updates handled by the system, reducing the maintenance burden on your team.
Steps to Implement:
Identify Aggregation Queries:
Analyze the existing queries to identify common aggregation patterns at the day and month levels.
Create Materialized Views:
Create materialized views in BigQuery for the identified aggregation patterns. For example CREATE MATERIALIZED VIEW project.dataset.sales_daily_summary AS SELECT
DATE(transaction_time) AS day,
SUM(amount) AS total_sales
FROM
project.dataset.sales
GROUP BY
day;
CREATE MATERIALIZED VIEW project.dataset.sales_monthly_summary AS
SELECT
EXTRACT(YEAR FROM transaction_time) AS year,
EXTRACT(MONTH FROM transaction_time) AS month,
SUM(amount) AS total_sales
FROM
project.dataset.sales
GROUP BY
year, month;
Query Using Materialized Views:
Update existing queries to use the materialized views instead of directly querying the base table.
Reference: Links:
BigQuery Materialized Views
Optimizing Query Performance
Which of these statements about exporting data from BigQuery is false?
- A . To export more than 1 GB of data, you need to put a wildcard in the destination filename.
- B . The only supported export destination is Google Cloud Storage.
- C . Data can only be exported in JSON or Avro format.
- D . The only compression option available is GZIP.
C
Explanation:
Data can be exported in CSV, JSON, or Avro format. If you are exporting nested or repeated data, then CSV format is not supported.
Reference: https://cloud.google.com/bigquery/docs/exporting-data
