Practice Free Associate Data Practitioner Exam Online Questions
Your company’s ecommerce website collects product reviews from customers. The reviews are loaded as CSV files daily to a Cloud Storage bucket. The reviews are in multiple languages and need to be translated to Spanish. You need to configure a pipeline that is serverless, efficient, and requires minimal maintenance.
What should you do?
- A . Load the data into BigQuery using Dataproc. Use Apache Spark to translate the reviews by invoking the Cloud Translation API. Set BigQuery as the sink.U
- B . Use a Dataflow templates pipeline to translate the reviews using the Cloud Translation API. Set BigQuery as the sink.
- C . Load the data into BigQuery using a Cloud Run function. Use the BigQuery ML create model statement to train a translation model. Use the model to translate the product reviews within BigQuery.
- D . Load the data into BigQuery using a Cloud Run function. Create a BigQuery remote function that invokes the Cloud Translation API. Use a scheduled query to translate new reviews.
D
Explanation:
Loading the data into BigQuery using a Cloud Run function and creating a BigQuery remote function that invokes the Cloud Translation API is a serverless and efficient approach. With this setup, you can use a scheduled query in BigQuery to invoke the remote function and translate new product reviews on a regular basis. This solution requires minimal maintenance, as BigQuery handles storage and querying, and the Cloud Translation API provides accurate translations without the need for custom ML model development.
You are constructing a data pipeline to process sensitive customer data stored in a Cloud Storage bucket. You need to ensure that this data remains accessible, even in the event of a single-zone outage.
What should you do?
- A . Set up a Cloud CDN in front of the bucket.
- B . Enable Object Versioning on the bucket.
- C . Store the data in a multi-region bucket.
- D . Store the data in Nearline storaqe.
C
Explanation:
Storing the data in a multi-region bucket ensures high availability and durability, even in the event of
a single-zone outage. Multi-region buckets replicate data across multiple locations within the selected region, providing resilience against zone-level failures and ensuring that the data remains accessible. This approach is particularly suitable for sensitive customer data that must remain available without interruptions.
A single-zone outage requires high availability across zones or regions.
Cloud Storage offers location-based redundancy options:
Option A: Cloud CDN caches content for web delivery but doesn’t protect against underlying storage outages―it’s for performance, not availability of the source data.
Option B: Object Versioning retains old versions of objects, protecting against overwrites or deletions, but doesn’t ensure availability during a zone failure (still tied to one location).
Option C: Multi-region buckets (e.g., us or eu) replicate data across multiple regions, ensuring accessibility even if a single zone or region fails. This provides the highest availability for sensitive data in a pipeline.
Option D: Nearline storage is a low-cost class with lower availability (single-region by default), not designed for outage resilience.
Why C is Best: Multi-region storage offers geo-redundancy, guaranteeing data access during outages with no additional configuration. It’s the simplest, most reliable choice for pipeline continuity.
Extract from Google Documentation: From "Cloud Storage Bucket Locations"
(https://cloud.google.com/storage/docs/locations): "Multi-region buckets replicate data across
multiple regions within a geography, providing high availability and durability, ensuring access even if a single zone or region experiences an outage."
Reference: Google Cloud Documentation – "Cloud Storage Locations"
(https://cloud.google.com/storage/docs/locations).
You need to transfer approximately 300 TB of data from your company’s on-premises data center to Cloud Storage. You have 100 Mbps internet bandwidth, and the transfer needs to be completed as quickly as possible.
What should you do?
- A . Use Cloud Client Libraries to transfer the data over the internet.
- B . Use the gcloud storage command to transfer the data over the internet.
- C . Compress the data, upload it to multiple cloud storage providers, and then transfer the data to Cloud Storage.
- D . Request a Transfer Appliance, copy the data to the appliance, and ship it back to Google.
D
Explanation:
Comprehensive and Detailed In-Depth
Transferring 300 TB over a 100 Mbps connection would take an impractical amount of time (over 300
days at theoretical maximum speed, ignoring real-world constraints like latency). Google Cloud provides the Transfer Appliance for large-scale, time-sensitive transfers.
Option A: Cloud Client Libraries over the internet would be slow and unreliable for 300 TB due to bandwidth limitations.
Option B: The gcloud storage command is similarly constrained by internet speed and not designed for such large transfers.
Option C: Compressing and splitting across multiple providers adds complexity and isn’t a Google-supported method for Cloud Storage ingestion.
Option D: The Transfer Appliance is a physical device shipped to your location, capable of handling terabytes of data quickly (e.g., 300 TB in days via high-speed local copy), then shipped back to Google for upload to Cloud Storage.
Extract from Google Documentation: From "Transferring Data to Google Cloud Storage"
(https://cloud.google.com/storage/docs/transferring-data): "For transferring large amounts of data (hundreds of terabytes or more), consider using the Transfer Appliance, a high-capacity storage server that you load with data and ship to a Google data center. This is ideal when transferring over the internet would take too long."
Reference: Google Cloud Documentation – "Transfer Appliance Overview"
(https://cloud.google.com/transfer-appliance).
Your organization’s ecommerce website collects user activity logs using a Pub/Sub topic. Your organization’s leadership team wants a dashboard that contains aggregated user engagement metrics. You need to create a solution that transforms the user activity logs into aggregated metrics, while ensuring that the raw data can be easily queried.
What should you do?
- A . Create a Dataflow subscription to the Pub/Sub topic, and transform the activity logs. Load the transformed data into a BigQuery table for reporting.
- B . Create an event-driven Cloud Run function to trigger a data transformation pipeline to run. Load the transformed activity logs into a BigQuery table for reporting.
- C . Create a Cloud Storage subscription to the Pub/Sub topic. Load the activity logs into a bucket using the Avro file format. Use Dataflow to transform the data, and load it into a BigQuery table for reporting.
- D . Create a BigQuery subscription to the Pub/Sub topic, and load the activity logs into the table. Create a materialized view in BigQuery using SQL to transform the data for reporting
A
Explanation:
Using Dataflow to subscribe to the Pub/Sub topic and transform the activity logs is the best approach for this scenario. Dataflow is a managed service designed for processing and transforming streaming data in real time. It allows you to aggregate metrics from the raw activity logs efficiently and load the transformed data into a BigQuery table for reporting. This solution ensures scalability, supports real-time processing, and enables querying of both raw and aggregated data in BigQuery, providing the flexibility and insights needed for the dashboard.
Your organization uses Dataflow pipelines to process real-time financial transactions. You discover that one of your Dataflow jobs has failed. You need to troubleshoot the issue as quickly as possible.
What should you do?
- A . Set up a Cloud Monitoring dashboard to track key Dataflow metrics, such as data throughput, error rates, and resource utilization.
- B . Create a custom script to periodically poll the Dataflow API for job status updates, and send email alerts if any errors are identified.
- C . Navigate to the Dataflow Jobs page in the Google Cloud console. Use the job logs and worker logs to identify the error.
- D . Use the gcloud CLI tool to retrieve job metrics and logs, and analyze them for errors and performance bottlenecks.
C
Explanation:
To troubleshoot a failed Dataflow job as quickly as possible, you should navigate to the Dataflow Jobs page in the Google Cloud console. The console provides access to detailed job logs and worker logs, which can help you identify the cause of the failure. The graphical interface also allows you to visualize pipeline stages, monitor performance metrics, and pinpoint where the error occurred, making it the most efficient way to diagnose and resolve the issue promptly.
Extract from Google Documentation: From "Monitoring Dataflow Jobs"
(https://cloud.google.com/dataflow/docs/guides/monitoring-jobs): "To troubleshoot a failed Dataflow job quickly, go to the Dataflow Jobs page in the Google Cloud Console, where you can view job logs and worker logs to identify errors and their root causes."
Reference: Google Cloud Documentation – "Dataflow Monitoring" (https://cloud.google.com/dataflow/docs/guides/monitoring-jobs).
BigQuery
Explanation:
To build a serverless data pipeline that processes data in real-time from Pub/Sub, transforms it, and stores it for SQL-based analysis using Looker, the best solution is to use Dataflow and BigQuery. Dataflow is a fully managed service for real-time data processing and transformation, while BigQuery is a serverless data warehouse that supports SQL-based querying and integrates seamlessly with Looker for data analysis and visualization. This combination meets the requirements for real-time streaming, transformation, and efficient storage for analytical queries.
Social media activity: Cloud Storage
Explanation:
Online transactions: Storing the transactional data in BigQuery is ideal because BigQuery is a serverless data warehouse optimized for querying and analyzing structured data at scale. It supports SQL queries and is suitable for structured transactional data.
Customer feedback: Storing customer feedback in Cloud Storage is appropriate as it allows you to store unstructured text files reliably and at a low cost. Cloud Storage also integrates well with data processing and ML tools for further analysis.
Social media activity: Storing real-time social media activity in BigQuery is optimal because BigQuery supports streaming inserts, enabling real-time ingestion and analysis of data. This allows immediate analysis and integration into dashboards or ML pipelines.
Your company uses Looker as its primary business intelligence platform. You want to use LookML to visualize the profit margin for each of your company’s products in your Looker Explores and dashboards. You need to implement a solution quickly and efficiently.
What should you do?
- A . Create a derived table that pre-calculates the profit margin for each product, and include it in the Looker model.
- B . Define a new measure that calculates the profit margin by using the existing revenue and cost fields.
- C . Create a new dimension that categorizes products based on their profit margin ranges (e.g., high, medium, low).
- D . Apply a filter to only show products with a positive profit margin.
B
Explanation:
Defining a new measure in LookML to calculate the profit margin using the existing revenue and cost fields is the most efficient and straightforward solution. This approach allows you to dynamically compute the profit margin directly within your Looker Explores and dashboards without needing to pre-calculate or create additional tables. The measure can be defined using LookML syntax, such as:
measure: profit_margin {
type: number
sql: (revenue – cost) / revenue ;;
value_format: "0.0%"
}
This method is quick to implement and integrates seamlessly into your existing Looker model, enabling accurate visualization of profit margins across your products.
Your team is building several data pipelines that contain a collection of complex tasks and dependencies that you want to execute on a schedule, in a specific order. The tasks and dependencies consist of files in Cloud Storage, Apache Spark jobs, and data in BigQuery. You need to design a system that can schedule and automate these data processing tasks using a fully managed approach.
What should you do?
- A . Use Cloud Scheduler to schedule the jobs to run.
- B . Use Cloud Tasks to schedule and run the jobs asynchronously.
- C . Create directed acyclic graphs (DAGs) in Cloud Composer. Use the appropriate operators to connect to Cloud Storage, Spark, and BigQuery.
- D . Create directed acyclic graphs (DAGs) in Apache Airflow deployed on Google Kubernetes Engine.
Use the appropriate operators to connect to Cloud Storage, Spark, and BigQuery.
C
Explanation:
Using Cloud Composer to create Directed Acyclic Graphs (DAGs) is the best solution because it is a fully managed, scalable workflow orchestration service based on Apache Airflow. Cloud Composer allows you to define complex task dependencies and schedules while integrating seamlessly with Google Cloud services such as Cloud Storage, BigQuery, and Dataproc for Apache Spark jobs. This approach minimizes operational overhead, supports scheduling and automation, and provides an efficient and fully managed way to orchestrate your data pipelines.
Extract from Google Documentation: From "Cloud Composer Overview"
(https://cloud.google.com/composer/docs): "Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow, enabling you to schedule and automate complex data pipelines with dependencies across Google Cloud services like Cloud Storage, Dataproc, and BigQuery."
Reference: Google Cloud Documentation – "Cloud Composer" (https://cloud.google.com/composer).
Your company is building a near real-time streaming pipeline to process JSON telemetry data from small appliances. You need to process messages arriving at a Pub/Sub topic, capitalize letters in the serial number field, and write results to BigQuery. You want to use a managed service and write a minimal amount of code for underlying transformations.
What should you do?
- A . Use a Pub/Sub to BigQuery subscription, write results directly to BigQuery, and schedule a transformation query to run every five minutes.
- B . Use a Pub/Sub to Cloud Storage subscription, write a Cloud Run service that is triggered when objects arrive in the bucket, performs the transformations, and writes the results to BigQuery.
- C . Use the “Pub/Sub to BigQuery” Dataflow template with a UDF, and write the results to BigQuery.
- D . Use a Pub/Sub push subscription, write a Cloud Run service that accepts the messages, performs the transformations, and writes the results to BigQuery.
C
Explanation:
Using the "Pub/Sub to BigQuery" Dataflow template with a UDF (User-Defined Function) is the optimal choice because it combines near real-time processing, minimal code for transformations, and scalability. The UDF allows for efficient implementation of custom transformations, such as capitalizing letters in the serial number field, while Dataflow handles the rest of the managed pipeline seamlessly.