Practice Free Professional Data Engineer Exam Online Questions
You work for a global shipping company. You want to train a model on 40 TB of data to predict which ships in each geographic region are likely to cause delivery delays on any given day. The model will be based on multiple attributes collected from multiple sources. Telemetry data, including location in GeoJSON format, will be pulled from each ship and loaded every hour. You want to have a dashboard that shows how many and which ships are likely to cause delays within a region. You want to use a storage solution that has native functionality for prediction and geospatial processing.
Which storage solution should you use?
- A . BigQuery
- B . Cloud Bigtable
- C . Cloud Datastore
- D . Cloud SQL for PostgreSQL
Have the central data platform team manage all lakes’ data assets.
Explanation:
To design a data mesh architecture using Dataplex to eliminate bottlenecks caused by a central data platform team, consider the following:
Data Mesh Architecture:
Data mesh promotes a decentralized approach where domain teams manage their own data pipelines and assets, increasing agility and reducing bottlenecks.
Dataplex Lakes and Zones:
Lakes in Dataplex are logical containers for managing data at scale, and zones are subdivisions within lakes for organizing data based on domains, teams, or other criteria.
Domain and Team Management:
By creating a lake for each team and zones for each domain, each team can independently manage their data assets without relying on the central data platform team.
This setup aligns with the principles of data mesh, promoting ownership and reducing delays in data processing and insights.
Implementation Steps:
Create Lakes and Zones:
Create separate lakes in Dataplex for each team (analytics and data science).
Within each lake, create zones for the different domains (airlines, hotels, ride-hailing).
Attach BigQuery Datasets:
Attach the BigQuery datasets created by the respective teams as assets to their corresponding zones.
Decentralized Management:
Allow each domain to manage their own zone’s data assets, providing them with the autonomy to update and maintain their pipelines without depending on the central team.
Reference: Links:
Dataplex Documentation
BigQuery Documentation
Data Mesh Principles
You have an Oracle database deployed in a VM as part of a Virtual Private Cloud (VPC) network. You want to replicate and continuously synchronize 50 tables to BigQuery. You want to minimize the need to manage infrastructure.
What should you do?
- A . Create a Datastream service from Oracle to BigQuery, use a private connectivity configuration to the same VPC network, and a connection profile to BigQuery.
- B . Create a Pub/Sub subscription to write to BigQuery directly Deploy the Debezium Oracle connector to capture changes in the Oracle database, and sink to the Pub/Sub topic.
- C . Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle Change Data Capture (CDC), and Dataflow to stream the Kafka topic to BigQuery.
- D . Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle change data capture (CDC), and the Kafka Connect Google BigQuery Sink Connector.
A
Explanation:
Datastream is a serverless, scalable, and reliable service that enables you to stream data changes from Oracle and MySQL databases to Google Cloud services such as BigQuery, Cloud SQL, Google Cloud Storage, and Cloud Pub/Sub. Datastream captures and streams database changes using change data capture (CDC) technology. Datastream supports private connectivity to the source and destination systems using VPC networks. Datastream also provides a connection profile to BigQuery, which simplifies the configuration and management of the data replication.
Reference: Datastream overview
Creating a Datastream stream
Using Datastream with BigQuery
You have an Oracle database deployed in a VM as part of a Virtual Private Cloud (VPC) network. You want to replicate and continuously synchronize 50 tables to BigQuery. You want to minimize the need to manage infrastructure.
What should you do?
- A . Create a Datastream service from Oracle to BigQuery, use a private connectivity configuration to the same VPC network, and a connection profile to BigQuery.
- B . Create a Pub/Sub subscription to write to BigQuery directly Deploy the Debezium Oracle connector to capture changes in the Oracle database, and sink to the Pub/Sub topic.
- C . Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle Change Data Capture (CDC), and Dataflow to stream the Kafka topic to BigQuery.
- D . Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle change data capture (CDC), and the Kafka Connect Google BigQuery Sink Connector.
A
Explanation:
Datastream is a serverless, scalable, and reliable service that enables you to stream data changes from Oracle and MySQL databases to Google Cloud services such as BigQuery, Cloud SQL, Google Cloud Storage, and Cloud Pub/Sub. Datastream captures and streams database changes using change data capture (CDC) technology. Datastream supports private connectivity to the source and destination systems using VPC networks. Datastream also provides a connection profile to BigQuery, which simplifies the configuration and management of the data replication.
Reference: Datastream overview
Creating a Datastream stream
Using Datastream with BigQuery
Your company’s on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for-like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration.
What should you do?
- A . Put the data into Google Cloud Storage.
- B . Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster.
- C . Tune the Cloud Dataproc cluster so that there is just enough disk for all data.
- D . Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.
Your company’s on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for-like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration.
What should you do?
- A . Put the data into Google Cloud Storage.
- B . Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster.
- C . Tune the Cloud Dataproc cluster so that there is just enough disk for all data.
- D . Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk.
You are designing a basket abandonment system for an ecommerce company.
The system will send a message to a user based on these rules:
– No interaction by the user on the site for 1 hour
– Has added more than $30 worth of products to the basket
– Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent.
How should you design the pipeline?
- A . Use a fixed-time window with a duration of 60 minutes.
- B . Use a sliding time window with a duration of 60 minutes.
- C . Use a session window with a gap time duration of 60 minutes.
- D . Use a global window with a time based trigger with a delay of 60 minutes.
Cloud Dataproc charges you only for what you really use with _____ billing.
- A . month-by-month
- B . minute-by-minute
- C . week-by-week
- D . hour-by-hour
B
Explanation:
One of the advantages of Cloud Dataproc is its low cost. Dataproc charges for what you really use with minute-by-minute billing and a low, ten-minute-minimum billing period.
Reference: https://cloud.google.com/dataproc/docs/concepts/overview
You are a BigQuery admin supporting a team of data consumers who run ad hoc queries and downstream reporting in tools such as Looker. All data and users are combined under a single organizational project. You recently noticed some slowness in query results and want to troubleshoot where the slowdowns are occurring. You think that there might be some job queuing or slot contention occurring as users run jobs, which slows down access to results. You need to investigate the query job information and determine where performance is being affected.
What should you do?
- A . Use Cloud Monitoring to view BigQuery metrics and set up alerts that let you know when a certain percentage of slots were used.
- B . Use slot reservations for your project to ensure that you have enough query processing capacity and are able to allocate available slots to the slower queries.
- C . Use Cloud Logging to determine if any users or downstream consumers are changing or deleting access grants on tagged resources.
- D . Use available administrative resource charts to determine how slots are being used and how jobs are performing over time. Run a query on the INFORMATION_SCHEMA to review query performance.
D
Explanation:
To troubleshoot query performance issues related to job queuing or slot contention in BigQuery, using administrative resource charts along with querying the INFORMATION_SCHEMA is the best approach.
Here’s why option D is the best choice:
Administrative Resource Charts:
BigQuery provides detailed resource charts that show slot usage and job performance over time.
These charts help identify patterns of slot contention and peak usage times.
INFORMATION_SCHEMA Queries:
The INFORMATION_SCHEMA tables in BigQuery provide detailed metadata about query jobs, including execution times, slots consumed, and other performance metrics.
Running queries on INFORMATION_SCHEMA allows you to pinpoint specific jobs causing contention and analyze their performance characteristics.
Comprehensive Analysis:
Combining administrative resource charts with detailed queries on INFORMATION_SCHEMA provides a holistic view of the system’s performance.
This approach enables you to identify and address the root causes of performance issues, whether they are due to slot contention, inefficient queries, or other factors.
Steps to Implement:
Access Administrative Resource Charts:
Use the Google Cloud Console to view BigQuery’s administrative resource charts. These charts
provide insights into slot utilization and job performance metrics over time.
Run INFORMATION_SCHEMA Queries:
Execute queries on BigQuery’s INFORMATION_SCHEMA to gather detailed information about job performance. For example:
SELECT
creation_time,
job_id,
user_email,
query,
total_slot_ms / 1000 AS slot_seconds,
total_bytes_processed / (1024 * 1024 * 1024) AS processed_gb, total_bytes_billed / (1024 * 1024 * 1024) AS billed_gb
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
AND state = ‘DONE’
ORDER BY
slot_seconds DESC
LIMIT 100;
Analyze and Optimize:
Use the information gathered to identify bottlenecks, optimize queries, and adjust resource allocations as needed to improve performance.
Reference: Links:
Monitoring BigQuery Slots
BigQuery INFORMATION_SCHEMA
BigQuery Performance Best Practices
You plan to deploy Cloud SQL using MySQL. You need to ensure high availability in the event of a zone failure.
What should you do?
- A . Create a Cloud SQL instance in one zone, and create a failover replica in another zone within the same region.
- B . Create a Cloud SQL instance in one zone, and create a read replica in another zone within the same region.
- C . Create a Cloud SQL instance in one zone, and configure an external read replica in a zone in a different region.
- D . Create a Cloud SQL instance in a region, and configure automatic backup to a Cloud Storage bucket in the same region.
