Practice Free Associate Data Practitioner Exam Online Questions
You are predicting customer churn for a subscription-based service. You have a 50 PB historical customer dataset in BigQuery that includes demographics, subscription information, and engagement metrics. You want to build a churn prediction model with minimal overhead. You want to follow the Google-recommended approach.
What should you do?
- A . Export the data from BigQuery to a local machine. Use scikit-learn in a Jupyter notebook to build the churn prediction model.
- B . Use Dataproc to create a Spark cluster. Use the Spark MLlib within the cluster to build the churn prediction model.
- C . Create a Looker dashboard that is connected to BigQuery. Use LookML to predict churn.
- D . Use the BigQuery Python client library in a Jupyter notebook to query and preprocess the data in BigQuery. Use the CREATE MODEL statement in BigQueryML to train the churn prediction model.
D
Explanation:
Using the BigQuery Python client library to query and preprocess data directly in BigQuery and then leveraging BigQueryML to train the churn prediction model is the Google-recommended approach for this scenario. BigQueryML allows you to build machine learning models directly within BigQuery using SQL, eliminating the need to export data or manage additional infrastructure. This minimizes overhead, scales effectively for a dataset as large as 50 PB, and simplifies the end-to-end process of building and training the churn prediction model.
You want to process and load a daily sales CSV file stored in Cloud Storage into BigQuery for downstream reporting. You need to quickly build a scalable data pipeline that transforms the data while providing insights into data quality issues.
What should you do?
- A . Create a batch pipeline in Cloud Data Fusion by using a Cloud Storage source and a BigQuery sink.
- B . Load the CSV file as a table in BigQuery, and use scheduled queries to run SQL transformation scripts.
- C . Load the CSV file as a table in BigQuery. Create a batch pipeline in Cloud Data Fusion by using a BigQuery source and sink.
- D . Create a batch pipeline in Dataflow by using the Cloud Storage CSV file to BigQuery batch template.
A
Explanation:
Using Cloud Data Fusion to create a batch pipeline with a Cloud Storage source and a BigQuery sink is the best solution because:
Scalability: Cloud Data Fusion is a scalable, fully managed data integration service.
Data transformation: It provides a visual interface to design pipelines, enabling quick transformation of data.
Data quality insights: Cloud Data Fusion includes built-in tools for monitoring and addressing data quality issues during the pipeline creation and execution process.
You are storing data in Cloud Storage for a machine learning project. The data is frequently accessed
during the model training phase, minimally accessed after 30 days, and unlikely to be accessed after 90 days. You need to choose the appropriate storage class for the different stages of the project to minimize cost.
What should you do?
- A . Store the data in Nearline storage during the model training phase. Transition the data to Coldline storage 30 days after model deployment, and to Archive storage 90 days after model deployment.
- B . Store the data in Standard storage during the model training phase. Transition the data to Nearline storage 30 days after model deployment, and to Coldline storage 90 days after model deployment.
- C . Store the data in Nearline storage during the model training phase. Transition the data to Archive storage 30 days after model deployment, and to Coldline storage 90 days after model deployment.
- D . Store the data in Standard storage during the model training phase. Transition the data to Durable Reduced Availability (DRA) storage 30 days after model deployment, and to Coldline storage 90 days after model deployment.
B
Explanation:
Comprehensive and Detailed In-Depth
Cost minimization requires matching storage classes to access patterns using lifecycle rules.
Let’s assess:
Option A: Nearline during training (frequent access) incurs high retrieval costs and latency, unsuitable for ML workloads. Coldline after 30 days and Archive after 90 days are reasonable but misaligned initially.
Option B: Standard storage (no retrieval fees, low latency) is ideal for frequent access during training. Transitioning to Nearline (30-day minimum, low access) after 30 days and Coldline (90-day minimum, rare access) after 90 days matches the pattern and minimizes costs effectively.
Option C: Nearline during training is costly for frequent access, and Archive to Coldline is illogical (Archive is cheaper than Coldline).
Option D: DRA storage doesn’t exist in Google Cloud (legacy AWS term); the progression should be Standard -> Nearline -> Coldline.
Why B is Best: Standard ensures training efficiency, while Nearline and Coldline reduce costs as access drops, all manageable via lifecycle rules (e.g., SetStorageClass actions). This is Google’s recommended tiering strategy.
Extract from Google Documentation: From "Cloud Storage Classes"
(https://cloud.google.com/storage/docs/storage-classes): "Use Standard storage for frequently accessed data, such as during active ML training. Transition to Nearline after 30 days for infrequent
access, and Coldline after 90 days for rare access, optimizing costs with lifecycle management."
Reference: Google Cloud Documentation – "Storage Classes" (https://cloud.google.com/storage/docs/storage-classes).
Your team uses Google Sheets to track budget data that is updated daily. The team wants to compare budget data against actual cost data, which is stored in a BigQuery table. You need to create a solution that calculates the difference between each day’s budget and actual costs. You want to ensure that your team has access to daily-updated results in Google Sheets.
What should you do?
- A . Create a BigQuery external table by using the Drive URI of the Google sheet, and join the actual cost table with it. Save the joined table as a CSV file and open the file in Google Sheets.
- B . Download the budget data as a CSV file and upload the CSV file to a Cloud Storage bucket. Create a new BigQuery table from Cloud Storage, and join the actual cost table with it. Open the joined BigQuery table by using Connected Sheets.
- C . Download the budget data as a CSV file, and upload the CSV file to create a new BigQuery table. Join the actual cost table with the new BigQuery table, and save the results as a CSV file. Open the CSV file in Google Sheets.
- D . Create a BigQuery external table by using the Drive URI of the Google sheet, and join the actual cost table with it. Save the joined table, and open it by using Connected Sheets.
D
Explanation:
Comprehensive and Detailed in Depth
Why D is correct:Creating a BigQuery external table directly from the Google Sheet allows for real-time updates.
Joining the external table with the actual cost table in BigQuery performs the calculation.
Connected Sheets allows the team to access and analyze the results directly in Google Sheets, with the data being updated.
Why other options are incorrect:A: Saving as a CSV file loses the live connection and daily updates.
B: Downloading and uploading as a CSV file adds unnecessary steps and loses the live connection.
C: Same issue as B, losing the live connection.
Reference: BigQuery External Tables: https://cloud.google.com/bigquery/docs/external-tables
Connected Sheets: https://support.google.com/sheets/answer/9054368?hl=en
Your company is adopting BigQuery as their data warehouse platform. Your team has experienced Python developers. You need to recommend a fully-managed tool to build batch ETL processes that extract data from various source systems, transform the data using a variety of Google Cloud services, and load the transformed data into BigQuery. You want this tool to leverage your team’s Python skills.
What should you do?
- A . Use Dataform with assertions.
- B . Deploy Cloud Data Fusion and included plugins.
- C . Use Cloud Composer with pre-built operators.
- D . Use Dataflow and pre-built templates.
C
Explanation:
Comprehensive and Detailed In-Depth
The tool must be fully managed, support batch ETL, integrate with multiple Google Cloud services, and leverage Python skills.
Option A: Dataform is SQL-focused for ELT within BigQuery, not Python-centric, and lacks broad service integration for extraction.
Option B: Cloud Data Fusion is a visual ETL tool, not Python-focused, and requires more UI-based configuration than coding.
Option C: Cloud Composer (managed Apache Airflow) is fully managed, supports batch ETL via DAGs, integrates with various Google Cloud services (e.g., BigQuery, GCS) through operators, and allows custom Python code in tasks. It’s ideal for Python developers per the "Cloud Composer" documentation.
Option D: Dataflow excels at streaming and batch processing but focuses on Apache Beam (Python SDK available), not broad service orchestration. Pre-built templates limit customization.
Reference: Google Cloud Documentation – "Cloud Composer Overview" (https://cloud.google.com/composer/docs).
Your organization uses a BigQuery table that is partitioned by ingestion time. You need to remove data that is older than one year to reduce your organization’s storage costs. You want to use the most efficient approach while minimizing cost.
What should you do?
- A . Create a scheduled query that periodically runs an update statement in SQL that sets the “deleted" column to “yes” for data that is more than one year old. Create a view that filters out rows that have been marked deleted.
- B . Create a view that filters out rows that are older than one year.
- C . Require users to specify a partition filter using the alter table statement in SQL.
- D . Set the table partition expiration period to one year using the ALTER TABLE statement in SQL.
D
Explanation:
Setting the table partition expiration period to one year using the ALTER TABLE statement is the most efficient and cost-effective approach. This automatically deletes data in partitions older than one year, reducing storage costs without requiring manual intervention or additional queries. It minimizes administrative overhead and ensures compliance with your data retention policy while optimizing storage usage in BigQuery.
Extract from Google Documentation: From "Managing Partitioned Tables in BigQuery"
(https://cloud.google.com/bigquery/docs/partitioned-tables#expiration): "Set a partition expiration time using ALTER TABLE to automatically remove partitions older than a specified duration, reducing storage costs efficiently for ingestion-time partitioned tables."
Reference: Google Cloud Documentation – "BigQuery Partitioned Tables"
(https://cloud.google.com/bigquery/docs/partitioned-tables).
You are migrating data from a legacy on-premises MySQL database to Google Cloud. The database contains various tables with different data types and sizes, including large tables with millions of rows and transactional data. You need to migrate this data while maintaining data integrity, and minimizing downtime and cost.
What should you do?
- A . Set up a Cloud Composer environment to orchestrate a custom data pipeline. Use a Python script to extract data from the MySQL database and load it to MySQL on Compute Engine.
- B . Export the MySQL database to CSV files, transfer the files to Cloud Storage by using Storage Transfer Service, and load the files into a Cloud SQL for MySQL instance.
- C . Use Database Migration Service to replicate the MySQL database to a Cloud SQL for MySQL instance.
- D . Use Cloud Data Fusion to migrate the MySQL database to MySQL on Compute Engine.
C
Explanation:
Using Database Migration Service (DMS) to replicate the MySQL database to a Cloud SQL for MySQL instance is the best approach. DMS is a fully managed service designed for migrating databases to Google Cloud with minimal downtime and cost. It supports continuous data replication, ensuring data integrity during the migration process, and handles schema and data transfer efficiently. This solution is particularly suited for large tables and transactional data, as it maintains real-time synchronization between the source and target databases, minimizing downtime for the migration.
Your organization has several datasets in their data warehouse in BigQuery. Several analyst teams in different departments use the datasets to run queries. Your organization is concerned about the variability of their monthly BigQuery costs. You need to identify a solution that creates a fixed budget for costs associated with the queries run by each department.
What should you do?
- A . Create a custom quota for each analyst in BigQuery.
- B . Create a single reservation by using BigQuery editions. Assign all analysts to the reservation.
- C . Assign each analyst to a separate project associated with their department. Create a single reservation by using BigQuery editions. Assign all projects to the reservation.
- D . Assign each analyst to a separate project associated with their department. Create a single reservation for each department by using BigQuery editions. Create assignments for each project in the appropriate reservation.
D
Explanation:
Assigning each analyst to a separate project associated with their department and creating a single reservation for each department using BigQuery editions allows for precise cost management. By assigning each project to its department’s reservation, you can allocate fixed compute resources and budgets for each department, ensuring that their query costs are predictable and controlled. This approach aligns with your organization’s goal of creating a fixed budget for query costs while maintaining departmental separation and accountability.
Your organization has several datasets in their data warehouse in BigQuery. Several analyst teams in different departments use the datasets to run queries. Your organization is concerned about the variability of their monthly BigQuery costs. You need to identify a solution that creates a fixed budget for costs associated with the queries run by each department.
What should you do?
- A . Create a custom quota for each analyst in BigQuery.
- B . Create a single reservation by using BigQuery editions. Assign all analysts to the reservation.
- C . Assign each analyst to a separate project associated with their department. Create a single reservation by using BigQuery editions. Assign all projects to the reservation.
- D . Assign each analyst to a separate project associated with their department. Create a single reservation for each department by using BigQuery editions. Create assignments for each project in the appropriate reservation.
D
Explanation:
Assigning each analyst to a separate project associated with their department and creating a single reservation for each department using BigQuery editions allows for precise cost management. By assigning each project to its department’s reservation, you can allocate fixed compute resources and budgets for each department, ensuring that their query costs are predictable and controlled. This approach aligns with your organization’s goal of creating a fixed budget for query costs while maintaining departmental separation and accountability.
You are working with a small dataset in Cloud Storage that needs to be transformed and loaded into BigQuery for analysis. The transformation involves simple filtering and aggregation operations. You want to use the most efficient and cost-effective data manipulation approach.
What should you do?
- A . Use Dataproc to create an Apache Hadoop cluster, perform the ETL process using Apache Spark, and load the results into BigQuery.
- B . Use BigQuery’s SQL capabilities to load the data from Cloud Storage, transform it, and store the results in a new BigQuery table.
- C . Create a Cloud Data Fusion instance and visually design an ETL pipeline that reads data from Cloud Storage, transforms it using built-in transformations, and loads the results into BigQuery.
- D . Use Dataflow to perform the ETL process that reads the data from Cloud Storage, transforms it using Apache Beam, and writes the results to BigQuery.
B
Explanation:
Comprehensive and Detailed In-Depth
For a small dataset with simple transformations (filtering, aggregation), Google recommends leveraging BigQuery’s native SQL capabilities to minimize cost and complexity.
Option A: Dataproc with Spark is overkill for a small dataset, incurring cluster management costs and setup time.
Option B: BigQuery can load data directly from Cloud Storage (e.g., CSV, JSON) and perform transformations using SQL in a serverless manner, avoiding additional service costs. This is the most efficient and cost-effective approach.
Option C: Cloud Data Fusion is suited for complex ETL but adds overhead (instance setup, UI design) unnecessary for simple tasks.
Option D: Dataflow is powerful for large-scale or streaming ETL but introduces unnecessary complexity and cost for a small, simple batch job.
Extract from Google Documentation: From "Loading Data into BigQuery from Cloud Storage"
(https://cloud.google.com/bigquery/docs/loading-data-cloud-storage): "You can load data directly from Cloud Storage into BigQuery and use SQL queries to transform it without needing additional processing tools, making it cost-effective for simple transformations."
Reference: Google Cloud Documentation – "BigQuery Data Loading"
(https://cloud.google.com/bigquery/docs/loading-data).