Practice Free Professional Data Engineer Exam Online Questions
How can you get a neural network to learn about relationships between categories in a categorical feature?
- A . Create a multi-hot column
- B . Create a one-hot column
- C . Create a hash bucket
- D . Create an embedding column
D
Explanation:
There are two problems with one-hot encoding. First, it has high dimensionality, meaning that instead of having just one value, like a continuous feature, it has many values, or dimensions. This makes computation more time-consuming, especially if a feature has a very large number of categories. The second problem is that it doesn’t encode any relationships between the categories. They are completely independent from each other, so the network has no way of knowing which ones are similar to each other.
Both of these problems can be solved by representing a categorical feature with an embedding column. The idea is that each category has a smaller vector with, let’s say, 5 values in it. But unlike a one-hot vector, the values are not usually 0. The values are weights, similar to the weights that are used for basic features in a neural network. The difference is that each category has a set of weights (5 of them in this case).
You can think of each value in the embedding vector as a feature of the category. So, if two categories are very similar to each other, then their embedding vectors should be very similar too.
Reference: https://cloudacademy.com/google/introduction-to-google-cloud-machine-learning-engine-course/a-wide-and-deep-model.html
How can you get a neural network to learn about relationships between categories in a categorical feature?
- A . Create a multi-hot column
- B . Create a one-hot column
- C . Create a hash bucket
- D . Create an embedding column
D
Explanation:
There are two problems with one-hot encoding. First, it has high dimensionality, meaning that instead of having just one value, like a continuous feature, it has many values, or dimensions. This makes computation more time-consuming, especially if a feature has a very large number of categories. The second problem is that it doesn’t encode any relationships between the categories. They are completely independent from each other, so the network has no way of knowing which ones are similar to each other.
Both of these problems can be solved by representing a categorical feature with an embedding column. The idea is that each category has a smaller vector with, let’s say, 5 values in it. But unlike a one-hot vector, the values are not usually 0. The values are weights, similar to the weights that are used for basic features in a neural network. The difference is that each category has a set of weights (5 of them in this case).
You can think of each value in the embedding vector as a feature of the category. So, if two categories are very similar to each other, then their embedding vectors should be very similar too.
Reference: https://cloudacademy.com/google/introduction-to-google-cloud-machine-learning-engine-course/a-wide-and-deep-model.html
What are two of the benefits of using denormalized data structures in BigQuery?
- A . Reduces the amount of data processed, reduces the amount of storage required
- B . Increases query speed, makes queries simpler
- C . Reduces the amount of storage required, increases query speed
- D . Reduces the amount of data processed, increases query speed
B
Explanation:
Denormalization increases query speed for tables with billions of rows because BigQuery’s performance degrades when doing JOINs on large tables, but with a denormalized data
structure, you don’t have to use JOINs, since all of the data has been combined into one table.
Denormalization also makes queries simpler because you do not have to use JOIN clauses.
Denormalization increases the amount of data processed and the amount of storage required because it creates redundant data.
Reference: https://cloud.google.com/solutions/bigquery-data-warehouse#denormalizing_data
To give a user read permission for only the first three columns of a table, which access control method would you use?
- A . Primitive role
- B . Predefined role
- C . Authorized view
- D . It’s not possible to give access to only the first three columns of a table.
C
Explanation:
An authorized view allows you to share query results with particular users and groups without giving them read access to the underlying tables. Authorized views can only be created in a dataset that does not contain the tables queried by the view.
When you create an authorized view, you use the view’s SQL query to restrict access to only the rows and columns you want the users to see.
Reference: https://cloud.google.com/bigquery/docs/views#authorized-views
To give a user read permission for only the first three columns of a table, which access control method would you use?
- A . Primitive role
- B . Predefined role
- C . Authorized view
- D . It’s not possible to give access to only the first three columns of a table.
C
Explanation:
An authorized view allows you to share query results with particular users and groups without giving them read access to the underlying tables. Authorized views can only be created in a dataset that does not contain the tables queried by the view.
When you create an authorized view, you use the view’s SQL query to restrict access to only the rows and columns you want the users to see.
Reference: https://cloud.google.com/bigquery/docs/views#authorized-views
You have one BigQuery dataset which includes customers’ street addresses. You want to retrieve all occurrences of street addresses from the dataset.
What should you do?
- A . Create a deep inspection job on each table in your dataset with Cloud Data Loss Prevention and create an inspection template that includes the STREET_ADDRESS infoType.
- B . Create a de-identification job in Cloud Data Loss Prevention and use the masking transformation.
- C . Write a SQL query in BigQuery by using REGEXP_CONTAINS on all tables in your dataset to find rows where the word "street" appears.
- D . Create a discovery scan configuration on your organization with Cloud Data Loss Prevention and create an inspection template that
includes the STREET_ADDRESS infoType.
A
Explanation:
To retrieve all occurrences of street addresses from a BigQuery dataset, the most effective and comprehensive method is to use Cloud Data Loss Prevention (DLP). Here’s why option A is the best choice:
Cloud Data Loss Prevention (DLP):
Cloud DLP is designed to discover, classify, and protect sensitive information. It includes pre-defined infoTypes for various kinds of sensitive data, including street addresses.
Using Cloud DLP ensures thorough and accurate detection of street addresses based on advanced pattern recognition and contextual analysis.
Deep Inspection Job:
A deep inspection job allows you to scan entire tables for sensitive information.
By creating an inspection template that includes the STREET_ADDRESS infoType, you can ensure that all instances of street addresses are detected across your dataset.
Scalability and Accuracy:
Cloud DLP is scalable and can handle large datasets efficiently.
It provides a high level of accuracy in identifying sensitive data, reducing the risk of missing any occurrences.
Steps to Implement:
Set Up Cloud DLP:
Enable the Cloud DLP API in your Google Cloud project.
Create an Inspection Template:
Create an inspection template in Cloud DLP that includes the STREET_ADDRESS infoType.
Run Deep Inspection Jobs:
Create and run a deep inspection job for each table in your dataset using the inspection template. Review the inspection job results to retrieve all occurrences of street addresses.
Reference: Links:
Cloud DLP Documentation
Creating Inspection Jobs
You have a streaming pipeline that ingests data from Pub/Sub in production. You need to update this streaming pipeline with improved business logic. You need to ensure that the updated pipeline reprocesses the previous two days of delivered Pub/Sub messages.
What should you do? Choose 2 answers
- A . Use Pub/Sub Seek with a timestamp.
- B . Use the Pub/Sub subscription clear-retry-policy flag.
- C . Create a new Pub/Sub subscription two days before the deployment.
- D . Use the Pub/Sub subscription retain-asked-messages flag.
- E . Use Pub/Sub Snapshot capture two days before the deployment.
A, E
Explanation:
To update a streaming pipeline with improved business logic and reprocess the previous two days of delivered Pub/Sub messages, you should use Pub/Sub Seek with a timestamp and Pub/Sub Snapshot capture two days before the deployment. Pub/Sub Seek allows you to replay or purge messages in a subscription based on a time or a snapshot. Pub/Sub Snapshot allows you to capture the state of a subscription at a given point in time and replay messages from that point. By using these features, you can ensure that the updated pipeline can process the messages that were delivered in the past two days without losing any data.
Reference: Pub/Sub Seek
Pub/Sub Snapshot
Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow. Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.
The data scientists have written the following code to read the data for a new key features in the logs.
BigQueryIO.Read
.named(“ReadLogData”)
.from(“clouddataflow-readonly:samples.log_data”)
You want to improve the performance of this data read.
What should you do?
- A . Specify the TableReference: object in the code.
- B . Use .fromQuery operation to read specific fields from the table.
- C . Use of both the Google BigQuery TableSchema and TableFieldSchema classes.
- D . Call a transform that returns TableRow objects, where each element in the PCollexction represents a single row in the table.
Which role must be assigned to a service account used by the virtual machines in a Dataproc cluster so they can execute jobs?
- A . Dataproc Worker
- B . Dataproc Viewer
- C . Dataproc Runner
- D . Dataproc Editor
A
Explanation:
Service accounts used with Cloud Dataproc must have Dataproc/Dataproc Worker role (or have all the permissions granted by Dataproc Worker role).
Reference: https://cloud.google.com/dataproc/docs/concepts/service-accounts#important_notes
You are designing a cloud-native historical data processing system to meet the following conditions:
The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and Compute Engine.
A streaming data pipeline stores new data daily.
Peformance is not a factor in the solution.
The solution design should maximize availability.
How should you design data storage for this solution?
- A . Create a Cloud Dataproc cluster with high availability. Store the data in HDFS, and peform analysis as needed.
- B . Store the data in BigQuery. Access the data using the BigQuery Connector or Cloud Dataproc and Compute Engine.
- C . Store the data in a regional Cloud Storage bucket. Aceess the bucket directly using Cloud Dataproc, BigQuery, and Compute Engine.
- D . Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.
