Practice Free Professional Data Engineer Exam Online Questions
You have a BigQuery table that contains customer data, including sensitive information such as names and addresses. You need to share the customer data with your data analytics and consumer support teams securely. The data analytics team needs to access the data of all the customers, but must not be able to access the sensitive data. The consumer support team needs access to all data columns, but must not be able to access customers that no longer have active contracts. You enforced these requirements by using an authorized dataset and policy tags After implementing these steps, the data analytics team reports that they still have access to the sensitive columns. You need to ensure that the data analytics team does not have access to restricted data.
What should you do? Choose 2 answers
- A . Create two separate authorized datasets; one for the data analytics team and another for the consumer support team.
- B . Ensure that the data analytics team members do not have the Data Catalog Fine-Grained Reader role for the policy tags.
- C . Enforce access control in the policy tag taxonomy.
- D . Remove the bigquery. dataViewer role from the data analytics team on the authorized datasets.
- E . Replace the authorized dataset with an authorized view Use row-level security and apply filter_ expression to limit data access.
B, C
Explanation:
To ensure that the data analytics team does not have access to sensitive columns, you should:
B. Ensure that the data analytics team members do not have the Data Catalog Fine-Grained Reader role for the policy tags. This role allows users to read metadata for data assets that have policy tags applied, which could include sensitive information.
C. Enforce access control in the policy tag taxonomy. By setting access control at the policy tag level, you can restrict access to specific columns within a dataset, ensuring that only authorized users can view sensitive data.
Which row keys are likely to cause a disproportionate number of reads and/or writes on a particular node in a Bigtable cluster (select 2 answers)?
- A . A sequential numeric ID
- B . A timestamp followed by a stock symbol
- C . A non-sequential numeric ID
- D . A stock symbol followed by a timestamp
AB
Explanation:
…using a timestamp as the first element of a row key can cause a variety of problems.
In brief, when a row key for a time series includes a timestamp, all of your writes will target a single node; fill that node; and then move onto the next node in the cluster, resulting in hot spotting.
Suppose your system assigns a numeric ID to each of your application’s users. You might be tempted to use the user’s numeric ID as the row key for your table. However, since new users are more likely to be active users, this approach is likely to push most of your traffic to a small number of nodes. [https://cloud.google.com/bigtable/docs/schema-design]
Reference: https://cloud.google.com/bigtable/docs/schema-design-time-series#ensure_that_your_row_key_avoids_hotspotting
Which of the following job types are supported by Cloud Dataproc (select 3 answers)?
- A . Hive
- B . Pig
- C . YARN
- D . Spark
ABD
Explanation:
Cloud Dataproc provides out-of-the box and end-to-end support for many of the most popular job types, including Spark, Spark SQL, PySpark, MapReduce, Hive, and Pig jobs.
Reference: https://cloud.google.com/dataproc/docs/resources/faq#what_type_of_jobs_can_i_run
You work for a large real estate firm and are preparing 6 TB of home sales data lo be used for machine learning You will use SOL to transform the data and use BigQuery ML lo create a machine learning model. You plan to use the model for predictions against a raw dataset that has not been transformed.
How should you set up your workflow in order to prevent skew at prediction time?
- A . When creating your model, use BigQuerys TRANSFORM clause to define preprocessing stops. At prediction time, use BigQuery"s ML. EVALUATE clause without specifying any transformations on the raw input data.
- B . When creating your model, use BigQuery’s TRANSFORM clause to define preprocessing steps Before requesting predictions, use a saved query to transform your raw input data, and then use ML. EVALUATE
- C . Use a BigOuery to define your preprocessing logic. When creating your model, use the view as your model training data. At prediction lime, use BigQuery’s ML EVALUATE clause without specifying any transformations on the raw input data.
- D . Preprocess all data using Dataflow. At prediction time, use BigOuery"s ML. EVALUATE clause without specifying any further transformations on the input data.
A
Explanation:
https://cloud.google.com/bigquery-ml/docs/bigqueryml-transform Using the TRANSFORM clause, you can specify all preprocessing during model creation. The preprocessing is automatically applied during the prediction and evaluation phases of machine learning
You work for a large real estate firm and are preparing 6 TB of home sales data lo be used for machine learning You will use SOL to transform the data and use BigQuery ML lo create a machine learning model. You plan to use the model for predictions against a raw dataset that has not been transformed.
How should you set up your workflow in order to prevent skew at prediction time?
- A . When creating your model, use BigQuerys TRANSFORM clause to define preprocessing stops. At prediction time, use BigQuery"s ML. EVALUATE clause without specifying any transformations on the raw input data.
- B . When creating your model, use BigQuery’s TRANSFORM clause to define preprocessing steps Before requesting predictions, use a saved query to transform your raw input data, and then use ML. EVALUATE
- C . Use a BigOuery to define your preprocessing logic. When creating your model, use the view as your model training data. At prediction lime, use BigQuery’s ML EVALUATE clause without specifying any transformations on the raw input data.
- D . Preprocess all data using Dataflow. At prediction time, use BigOuery"s ML. EVALUATE clause without specifying any further transformations on the input data.
A
Explanation:
https://cloud.google.com/bigquery-ml/docs/bigqueryml-transform Using the TRANSFORM clause, you can specify all preprocessing during model creation. The preprocessing is automatically applied during the prediction and evaluation phases of machine learning
You work for a large real estate firm and are preparing 6 TB of home sales data lo be used for machine learning You will use SOL to transform the data and use BigQuery ML lo create a machine learning model. You plan to use the model for predictions against a raw dataset that has not been transformed.
How should you set up your workflow in order to prevent skew at prediction time?
- A . When creating your model, use BigQuerys TRANSFORM clause to define preprocessing stops. At prediction time, use BigQuery"s ML. EVALUATE clause without specifying any transformations on the raw input data.
- B . When creating your model, use BigQuery’s TRANSFORM clause to define preprocessing steps Before requesting predictions, use a saved query to transform your raw input data, and then use ML. EVALUATE
- C . Use a BigOuery to define your preprocessing logic. When creating your model, use the view as your model training data. At prediction lime, use BigQuery’s ML EVALUATE clause without specifying any transformations on the raw input data.
- D . Preprocess all data using Dataflow. At prediction time, use BigOuery"s ML. EVALUATE clause without specifying any further transformations on the input data.
A
Explanation:
https://cloud.google.com/bigquery-ml/docs/bigqueryml-transform Using the TRANSFORM clause, you can specify all preprocessing during model creation. The preprocessing is automatically applied during the prediction and evaluation phases of machine learning
Why do you need to split a machine learning dataset into training data and test data?
- A . So you can try two different sets of features
- B . To make sure your model is generalized for more than just the training data
- C . To allow you to create unit tests in your code
- D . So you can use one dataset for a wide model and one for a deep model
B
Explanation:
The flaw with evaluating a predictive model on training data is that it does not inform you on how well the model has generalized to new unseen data. A model that is selected for its accuracy on the training dataset rather than its accuracy on an unseen test dataset is very likely to have lower accuracy on an unseen test dataset. The reason is that the model is not as generalized. It has specialized to the structure in the training dataset. This is called overfitting.
Reference: https://machinelearningmastery.com/a-simple-intuition-for-overfitting/
Why do you need to split a machine learning dataset into training data and test data?
- A . So you can try two different sets of features
- B . To make sure your model is generalized for more than just the training data
- C . To allow you to create unit tests in your code
- D . So you can use one dataset for a wide model and one for a deep model
B
Explanation:
The flaw with evaluating a predictive model on training data is that it does not inform you on how well the model has generalized to new unseen data. A model that is selected for its accuracy on the training dataset rather than its accuracy on an unseen test dataset is very likely to have lower accuracy on an unseen test dataset. The reason is that the model is not as generalized. It has specialized to the structure in the training dataset. This is called overfitting.
Reference: https://machinelearningmastery.com/a-simple-intuition-for-overfitting/
