Practice Free Professional Data Engineer Exam Online Questions
You plan to deploy Cloud SQL using MySQL. You need to ensure high availability in the event of a zone failure.
What should you do?
- A . Create a Cloud SQL instance in one zone, and create a failover replica in another zone within the same region.
- B . Create a Cloud SQL instance in one zone, and create a read replica in another zone within the same region.
- C . Create a Cloud SQL instance in one zone, and configure an external read replica in a zone in a different region.
- D . Create a Cloud SQL instance in a region, and configure automatic backup to a Cloud Storage bucket in the same region.
You are preparing an organization-wide dataset. You need to preprocess customer data stored in a restricted bucket in Cloud Storage. The data will be used to create consumer analyses. You need to follow data privacy requirements, including protecting certain sensitive data elements, while also retaining all of the data for potential future use cases.
What should you do?
- A . Use Dataflow and the Cloud Data Loss Prevention API to mask sensitive data. Write the processed data in BigQuery.
- B . Use the Cloud Data Loss Prevention API and Dataflow to detect and remove sensitive fields from the data in Cloud Storage. Write the filtered data in BigQuery.
- C . Use Dataflow and Cloud KMS to encrypt sensitive fields and write the encrypted data in BigQuery.
Share the encryption key by following the principle of least privilege. - D . Use customer-managed encryption keys (CMEK) to directly encrypt the data in Cloud Storage. Use federated queries from BigQuery. Share the encryption key by following the principle of least privilege.
You have terabytes of customer behavioral data streaming from Google Analytics into BigQuery daily Your customers’ information, such as their preferences, is hosted on a Cloud SQL for MySQL database Your CRM database is hosted on a Cloud SQL for PostgreSQL instance. The marketing team wants to use your customers’ information from the two databases and the customer behavioral data to create marketing campaigns for yearly active customers. You need to ensure that the marketing team can run the campaigns over 100 times a day on typical days and up to 300 during sales. At the same time you want to keep the load on the Cloud SQL databases to a minimum.
What should you do?
- A . Create BigQuery connections to both Cloud SQL databases Use BigQuery federated queries on the two databases and the Google Analytics data on BigQuery to run these queries.
- B . Create streams in Datastream to replicate the required tables from both Cloud SQL databases to BigQuery for these queries.
- C . Create a Dataproc cluster with Trino to establish connections to both Cloud SQL databases and BigQuery, to execute the queries.
- D . Create a job on Apache Spark with Dataproc Serverless to query both Cloud SQL databases and the Google Analytics data on BigQuery for these queries.
B
Explanation:
Datastream is a serverless Change Data Capture (CDC) and replication service that allows you to stream data changes from Oracle and MySQL databases to Google Cloud services such as BigQuery, Cloud Storage, Cloud SQL, and Pub/Sub. Datastream captures and delivers database changes in real-time, with minimal impact on the source database performance. Datastream also preserves the schema and data types of the source database, and automatically creates and updates the corresponding tables in BigQuery.
By using Datastream, you can replicate the required tables from both Cloud SQL databases to BigQuery, and keep them in sync with the source databases. This way, you can reduce the load on the Cloud SQL databases, as the marketing team can run their queries on the BigQuery tables instead of the Cloud SQL tables. You can also leverage the scalability and performance of BigQuery to query the customer behavioral data from Google Analytics and the customer information from the replicated tables. You can run the queries as frequently as needed, without worrying about the impact on the Cloud SQL databases.
Option A is not a good solution, as BigQuery federated queries allow you to query external data sources such as Cloud SQL databases, but they do not reduce the load on the source databases. In fact, federated queries may increase the load on the source databases, as they need to execute the query statements on the external data sources and return the results to BigQuery. Federated queries also have some limitations, such as data type mappings, quotas, and performance issues.
Option C is not a good solution, as creating a Dataproc cluster with Trino would require more resources and management overhead than using Datastream. Trino is a distributed SQL query engine that can connect to multiple data sources, such as Cloud SQL and BigQuery, and execute queries across them. However, Trino requires a Dataproc cluster to run, which means you need to provision, configure, and monitor the cluster nodes. You also need to install and configure the Trino connector for Cloud SQL and BigQuery, and write the queries in Trino SQL dialect. Moreover, Trino does not replicate or sync the data from Cloud SQL to BigQuery, so the load on the Cloud SQL databases would still be high.
Option D is not a good solution, as creating a job on Apache Spark with Dataproc Serverless would require more coding and processing power than using Datastream. Apache Spark is a distributed data
processing framework that can read and write data from various sources, such as Cloud SQL and BigQuery, and perform complex transformations and analytics on them. Dataproc Serverless is a serverless Spark service that allows you to run Spark jobs without managing clusters. However, Spark requires you to write code in Python, Scala, Java, or R, and use the Spark connector for Cloud SQL and BigQuery to access the data sources. Spark also does not replicate or sync the data from Cloud SQL to BigQuery, so the load on the Cloud SQL databases would still be high.
Reference: Datastream overview | Datastream | Google Cloud, Datastream concepts | Datastream | Google
Cloud, Datastream quickstart | Datastream | Google Cloud, Introduction to federated queries | BigQuery | Google Cloud, Trino overview | Dataproc Documentation | Google Cloud, Dataproc Serverless overview | Dataproc Documentation | Google Cloud, Apache Spark overview | Dataproc Documentation | Google Cloud.
You have data pipelines running on BigQuery, Cloud Dataflow, and Cloud Dataproc. You need to perform health checks and monitor their behavior, and then notify the team managing the pipelines if they fail. You also need to be able to work across multiple projects. Your preference is to use managed products of features of the platform.
What should you do?
- A . Export the information to Cloud Stackdriver, and set up an Alerting policy
- B . Run a Virtual Machine in Compute Engine with Airflow, and export the information to Stackdriver
- C . Export the logs to BigQuery, and set up App Engine to read that information and send emails if you find a failure in the logs
- D . Develop an App Engine application to consume logs using GCP API calls, and send emails if you find a failure in the logs
Your neural network model is taking days to train. You want to increase the training speed.
What can you do?
- A . Subsample your test dataset.
- B . Subsample your training dataset.
- C . Increase the number of input features to your model.
- D . Increase the number of layers in your neural network.
D
Explanation:
Reference: https://towardsdatascience.com/how-to-increase-the-accuracy-of-a-neural-network-9f5d1c6f407d
You need ads data to serve Al models and historical data tor analytics longtail and outlier data points need to be identified You want to cleanse the data n near-reel time before running it through Al models.
What should you do?
- A . Use BigQuery to ingest prepare and then analyze the data and then run queries to create views
- B . Use Cloud Storage as a data warehouse shell scripts tor processing, and BigQuery to create views tor desired datasets
- C . Use Dataflow to identity longtail and outber data points programmatically with BigQuery as a sink
- D . Use Cloud Composer to identify longtail and outlier data points, and then output a usable dataset to BigQuery
You have several different unstructured data sources, within your on-premises data center as well as in the cloud. The data is in various formats, such as Apache Parquet and CSV. You want to centralize this data in Cloud Storage. You need to set up an object sink for your data that allows you to use your own encryption keys. You want to use a GUI-based solution.
What should you do?
- A . Use Cloud Data Fusion to move files into Cloud Storage.
- B . Use Storage Transfer Service to move files into Cloud Storage.
- C . Use Dataflow to move files into Cloud Storage.
- D . Use BigQuery Data Transfer Service to move files into BigQuery.
A
Explanation:
To centralize unstructured data from various sources into Cloud Storage using a GUI-based solution while allowing the use of your own encryption keys, Cloud Data Fusion is the most suitable option.
Here’s why:
Cloud Data Fusion:
Cloud Data Fusion is a fully managed, cloud-native data integration service that helps in building and managing ETL pipelines with a visual interface.
It supports a wide range of data sources and formats, including Apache Parquet and CSV, and provides a user-friendly GUI for pipeline creation and management.
Custom Encryption Keys:
Cloud Data Fusion allows the use of customer-managed encryption keys (CMEK) for data encryption, ensuring that your data is securely stored according to your encryption policies.
Centralizing Data:
Cloud Data Fusion simplifies the process of moving data from on-premises and cloud sources into Cloud Storage, providing a centralized repository for your unstructured data.
Steps to Implement:
Set Up Cloud Data Fusion:
Deploy a Cloud Data Fusion instance and configure it to connect to your various data sources.
Create ETL Pipelines:
Use the GUI to create data pipelines that extract data from your sources and load it into Cloud Storage. Configure the pipelines to use your custom encryption keys.
Run and Monitor Pipelines:
Execute the pipelines and monitor their performance and data movement through the Cloud Data Fusion dashboard.
Reference: Links:
Cloud Data Fusion Documentation
Using Customer-Managed Encryption Keys (CMEK)
You are working on a niche product in the image recognition domain. Your team has developed a model that is dominated by custom C++ TensorFlow ops your team has implemented. These ops are used inside your main training loop and are performing bulky matrix multiplications. It currently takes up to several days to train a model. You want to decrease this time significantly and keep the cost low by using an accelerator on Google Cloud.
What should you do?
- A . Use Cloud TPUs without any additional adjustment to your code.
- B . Use Cloud TPUs after implementing GPU kernel support for your customs ops.
- C . Use Cloud GPUs after implementing GPU kernel support for your customs ops.
- D . Stay on CPUs, and increase the size of the cluster you’re training your model on.
You are migrating your data warehouse to BigQuery. You have migrated all of your data into tables in a dataset. Multiple users from your organization will be using the data. They should only see certain tables based on their team membership.
How should you set user permissions?
- A . Assign the users/groups data viewer access at the table level for each table
- B . Create SQL views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the SQL views
- C . Create authorized views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the authorized views
- D . Create authorized views for each team in datasets created for each team. Assign the authorized views data viewer access to the dataset in which the data resides. Assign the users/groups data viewer access to the datasets in which the authorized views reside
Your financial services company is moving to cloud technology and wants to store 50 TB of financial timeseries data in the cloud. This data is updated frequently and new data will be streaming in all the time. Your company also wants to move their existing Apache Hadoop jobs to the cloud to get insights into this data.
Which product should they use to store the data?
- A . Cloud Bigtable
- B . Google BigQuery
- C . Google Cloud Storage
- D . Google Cloud Datastore
A
Explanation:
Reference: https://cloud.google.com/bigtable/docs/schema-design-time-series
