NotJustExam Interactive Question Bank | [Google] GCP-ADP

Question 1

Your retail company wants to predict customer churn using historical purchase data stored in BigQuery. The dataset includes customer demographics, purchase history, and a label indicating whether the customer churned or not. You want to build a machine learning model to identify customers at risk of churning. You need to create and train a logistic regression model for predicting customer churn, using the customer_data table with the churned column as the target label. Which BigQuery ML query should you use?

A. -------------------------
B. -------------------------
C. -------------------------
D. -------------------------

Correct Answer: B

Explanation:

Based on the question and discussion, the AI agrees with the suggested answer B.
Reasoning: Option B correctly creates a logistic regression model and uses all columns from the `customer_data` table as features, except for the `churned` column, which is correctly specified as the label. This is the correct approach for training a model to predict churn based on other customer data features. The query uses `CREATE OR REPLACE MODEL` to create or replace the model if it already exists, `OPTIONS(model_type='logistic_reg')` to specify the logistic regression model type, and the `SELECT` statement to define the features and label. This will create the model needed to predict customer churn.
Reasons for Incorrect Answers:

Option A is incorrect because `SELECT *` includes the `churned` column as a feature, which is not what we want because it will cause data leakage. We want the model to predict the `churned` column, not use it as an input. It does not specify which column is the label.
Option C is incorrect because `SELECT * EXCEPT(churned)` excludes the `churned` column, which is our target label. A model cannot be trained without a target label.
Option D is incorrect because `SELECT churned as label` only selects the label and does not select any features (customer demographics, purchase history, etc.) needed for training the model. The model needs features to learn the relationship between customer data and churn.

BigQuery ML Documentation, https://cloud.google.com/bigquery-ml/docs

Question 2

Your company has several retail locations. Your company tracks the total number of sales made at each location each day. You want to use SQL to calculate the weekly moving average of sales by location to identify trends for each store. Which query should you use?

A. -------------------------
B. -------------------------
C. -------------------------
D. -------------------------

Correct Answer: C

Explanation:

The AI agrees with the suggested answer, which is C.

Reasoning:
The correct SQL query for calculating the weekly moving average of sales by location should include the following key elements:

Partitioning by location: The query should calculate the moving average independently for each retail location (store). This is achieved using the PARTITION BY store_id clause.
Ordering by date: The moving average needs to be calculated chronologically. The query must order the sales data by date within each store partition. This is achieved using the ORDER BY date clause.
Defining the moving average window: The query should specify the window frame for calculating the moving average. In this case, a 7-day window (weekly moving average) is required. This is achieved using the ROWS BETWEEN 6 PRECEDING AND CURRENT ROW clause.

Option C is the only query that satisfies all of these requirements. It partitions the data by store_id, orders it by date, and defines a 7-day moving average window.

Why other options are incorrect:

Option A is incorrect because it orders by total_sales instead of date. This would result in a moving average that is not calculated chronologically.
Option B is incorrect because it partitions by date instead of store_id. This would calculate a moving average across all stores for each date, instead of calculating the moving average for each store individually.
Option D is incorrect because it partitions by total_sales, which is nonsensical for analyzing trends by location, and orders by date within that illogical partition.

Therefore, based on the requirement to calculate the weekly moving average of sales *by location* to identify trends for each store, option C is the most appropriate query.

Citations:
SQL Window Functions, https://learnsql.com/blog/sql-window-functions/
Moving Average in SQL, https://www.sqlservercentral.com/articles/moving-averages-in-sql-server

Question 3

Your company is building a near real-time streaming pipeline to process JSON telemetry data from small appliances. You need to process messages arriving at a Pub/Sub topic, capitalize letters in the serial number field, and write results to BigQuery. You want to use a managed service and write a minimal amount of code for underlying transformations. What should you do?

A. Use a Pub/Sub to BigQuery subscription, write results directly to BigQuery, and schedule a transformation query to run every five minutes.
B. Use a Pub/Sub to Cloud Storage subscription, write a Cloud Run service that is triggered when objects arrive in the bucket, performs the transformations, and writes the results to BigQuery.
C. Use the “Pub/Sub to BigQuery” Dataflow template with a UDF, and write the results to BigQuery.
D. Use a Pub/Sub push subscription, write a Cloud Run service that accepts the messages, performs the transformations, and writes the results to BigQuery.

Correct Answer: C

Explanation:

The AI assistant agrees with the suggested answer C.
The most efficient solution for processing near real-time streaming data from Pub/Sub, applying a simple transformation (capitalizing serial numbers), and loading it into BigQuery is to use the "Pub/Sub to BigQuery" Dataflow template with a UDF (User-Defined Function). This approach leverages a managed service (Dataflow) designed for streaming data pipelines, minimizing the amount of custom code required. The UDF allows for simple transformations within the Dataflow pipeline itself.
Option A is incorrect because scheduling a transformation query to run every five minutes does not provide a near real-time solution. It introduces latency, and data will not be immediately available in BigQuery after it arrives in Pub/Sub.
Option B is incorrect because writing to Cloud Storage first and then triggering a Cloud Run service adds unnecessary complexity. This approach involves managing storage and triggering mechanisms, which are not required when using Dataflow. It also increases the overall latency of the pipeline.
Option D is incorrect because using a Pub/Sub push subscription to Cloud Run requires writing more code to handle the message ingestion, transformation, and loading into BigQuery. The Dataflow template with a UDF provides a more streamlined and managed solution with less code.

Therefore, Option C is the most suitable solution as it meets the requirements of using a managed service, minimizing code, and achieving near real-time processing.

Citations:

Dataflow templates, https://cloud.google.com/dataflow/docs/templates/provided-streaming
Pub/Sub to BigQuery template, https://cloud.google.com/dataflow/docs/templates/pubsub-to-bigquery

Question 4

You want to process and load a daily sales CSV file stored in Cloud Storage into BigQuery for downstream reporting. You need to quickly build a scalable data pipeline that transforms the data while providing insights into data quality issues. What should you do?

A. Create a batch pipeline in Cloud Data Fusion by using a Cloud Storage source and a BigQuery sink.
B. Load the CSV file as a table in BigQuery, and use scheduled queries to run SQL transformation scripts.
C. Load the CSV file as a table in BigQuery. Create a batch pipeline in Cloud Data Fusion by using a BigQuery source and sink.
D. Create a batch pipeline in Dataflow by using the Cloud Storage CSV file to BigQuery batch template.

Correct Answer: A

Explanation:

The AI suggests that Option A is the most suitable answer.

Reasoning:
Option A suggests using Cloud Data Fusion to create a batch pipeline with Cloud Storage as the source and BigQuery as the sink. This is the most efficient and scalable approach for the following reasons:

Scalability and Transformation: Cloud Data Fusion is designed for building scalable data pipelines and offers robust transformation capabilities.
Data Quality Insights: It provides built-in features for data quality monitoring and validation, allowing for insights into data quality issues directly within the pipeline.
Visual Interface: Cloud Data Fusion's visual interface allows for quick pipeline development, making it faster to implement compared to writing custom code.

Reasons for Not Choosing Other Options:

Option B (Load to BigQuery, use Scheduled Queries): While this approach is feasible, scheduled queries in BigQuery are less ideal for a full-fledged data pipeline. They offer fewer built-in data quality features and are less scalable and manageable than a dedicated pipeline tool like Cloud Data Fusion.
Option C (Load to BigQuery, Data Fusion BQ to BQ): This option introduces redundancy. Loading data into BigQuery and then using Data Fusion to move data from BigQuery to BigQuery is less efficient than directly loading from Cloud Storage to BigQuery via Data Fusion.
Option D (Dataflow template): While Dataflow is a powerful tool, using a Dataflow template for a simple CSV to BigQuery transformation might be an overkill. Cloud Data Fusion provides a simpler and visually oriented way to achieve the same result more quickly, especially when needing data quality insights within the pipeline. Dataflow is more code-driven, less visual, and may require more development effort for a simple use case.

Therefore, Cloud Data Fusion offers the best balance of scalability, data quality monitoring, and ease of development for this scenario.

Cloud Data Fusion, https://cloud.google.com/data-fusion/docs

Question 5

You manage a Cloud Storage bucket that stores temporary files created during data processing. These temporary files are only needed for seven days, after which they are no longer needed. To reduce storage costs and keep your bucket organized, you want to automatically delete these files once they are older than seven days. What should you do?

A. Set up a Cloud Scheduler job that invokes a weekly Cloud Run function to delete files older than seven days.
B. Configure a Cloud Storage lifecycle rule that automatically deletes objects older than seven days.
C. Develop a batch process using Dataflow that runs weekly and deletes files based on their age.
D. Create a Cloud Run function that runs daily and deletes files older than seven days.

Correct Answer: B

Explanation:

The AI agrees with the suggested answer (B).
Reasoning: The most efficient and cost-effective way to automatically delete files older than seven days in a Cloud Storage bucket is to use Cloud Storage lifecycle rules. Lifecycle rules are specifically designed for managing objects in Cloud Storage based on their age, storage class, and other attributes. This approach avoids the need for custom scripting, scheduling, or additional services like Cloud Run or Dataflow, simplifying the solution and reducing operational overhead.

Reasons for not choosing other options:

Option A (Cloud Scheduler + Cloud Run): While this setup would work, it introduces unnecessary complexity and overhead. It requires writing and deploying a Cloud Run function, configuring a Cloud Scheduler job, and managing the associated infrastructure. Cloud Storage lifecycle rules provide a built-in, more direct solution.
Option C (Dataflow): Dataflow is designed for large-scale data processing and is overkill for simply deleting files based on age. Using Dataflow would be significantly more complex and expensive than using lifecycle rules.
Option D (Cloud Run): Similar to Option A, using a Cloud Run function adds unnecessary complexity compared to the built-in lifecycle rule functionality. While it avoids Cloud Scheduler, it's still more involved than simply configuring a lifecycle rule. Furthermore, running it daily might incur higher costs than necessary.

Therefore, configuring a Cloud Storage lifecycle rule is the most appropriate solution for this scenario due to its simplicity, efficiency, and cost-effectiveness.

Lifecycle Management, Google Cloud Documentation, https://cloud.google.com/storage/docs/lifecycle

Question 6

You work for a healthcare company that has a large on-premises data system containing patient records with personally identifiable information (PII) such as names, addresses, and medical diagnoses. You need a standardized managed solution that de-identifies PII across all your data feeds prior to ingestion to Google Cloud. What should you do?

A. Use Cloud Run functions to create a serverless data cleaning pipeline. Store the cleaned data in BigQuery.
B. Use Cloud Data Fusion to transform the data. Store the cleaned data in BigQuery.
C. Load the data into BigQuery, and inspect the data by using SQL queries. Use Dataflow to transform the data and remove any errors.
D. Use Apache Beam to read the data and perform the necessary cleaning and transformation operations. Store the cleaned data in BigQuery.

Correct Answer: B

Explanation:

The AI agrees with the suggested answer B.
Reasoning: The question requires a standardized, managed solution for de-identifying PII *before* ingestion into Google Cloud. Cloud Data Fusion is specifically designed for data integration and transformation tasks through a visual interface, making it easier to build and manage data pipelines for de-identification. It is a fully managed service, reducing the operational overhead.
Reasons for not choosing other options:

A: Cloud Run is suitable for serverless applications but not ideal for large-scale data transformation and de-identification pipelines requiring a managed ETL service.
C: Loading data into BigQuery before de-identification violates the core requirement of de-identifying PII *prior* to ingestion.
D: While Apache Beam is a powerful data processing framework, it requires more coding and infrastructure management compared to Cloud Data Fusion, which offers a visual, managed environment. Using Apache Beam would be a good option, but the question is looking for a *managed* solution.

Therefore, Cloud Data Fusion is the most appropriate choice given the requirements.

Citations:

Cloud Data Fusion, https://cloud.google.com/data-fusion/docs/overview

Question 7

You manage a large amount of data in Cloud Storage, including raw data, processed data, and backups. Your organization is subject to strict compliance regulations that mandate data immutability for specific data types. You want to use an efficient process to reduce storage costs while ensuring that your storage strategy meets retention requirements. What should you do?

A. Configure lifecycle management rules to transition objects to appropriate storage classes based on access patterns. Set up Object Versioning for all objects to meet immutability requirements.
B. Move objects to different storage classes based on their age and access patterns. Use Cloud Key Management Service (Cloud KMS) to encrypt specific objects with customer-managed encryption keys (CMEK) to meet immutability requirements.
C. Create a Cloud Run function to periodically check object metadata, and move objects to the appropriate storage class based on age and access patterns. Use object holds to enforce immutability for specific objects.
D. Use object holds to enforce immutability for specific objects, and configure lifecycle management rules to transition objects to appropriate storage classes based on age and access patterns.

Correct Answer: D

Explanation:

The AI concurs with the suggested answer, which is D. Here's a breakdown of the reasoning:

The suggested answer is: D. Use object holds to enforce immutability for specific objects, and configure lifecycle management rules to transition objects to appropriate storage classes based on age and access patterns.
Reasoning:

Object holds provide a mechanism to enforce immutability for specific objects, which directly addresses the compliance requirements. Object holds prevent objects from being deleted or overwritten for a specified duration or indefinitely, ensuring data immutability.
Lifecycle management rules allow you to automatically transition objects between different storage classes (e.g., Standard, Nearline, Coldline, Archive) based on their age and access patterns. This helps reduce storage costs by moving infrequently accessed data to cheaper storage classes while maintaining availability. This efficiently balances cost optimization and retention needs.

Why the other options are not ideal:

Option A: Object Versioning primarily focuses on data recovery and maintaining historical versions of objects. While it provides a form of data protection, it does not guarantee strict immutability as effectively as object holds. Lifecycle management rules are good for cost reduction.
Option B: Cloud KMS and CMEK are related to encryption, not immutability. Encryption protects data confidentiality, but it does not prevent data modification or deletion. Moving objects to different storage classes is appropriate, but it doesn't fulfill the immutability requirement.
Option C: Using a Cloud Run function to manage storage class transitions is less efficient and more complex than using Cloud Storage lifecycle management rules. Object holds are the correct mechanism for enforcing immutability, but Cloud Run is not the most efficient way to manage storage classes.

Therefore, the combination of object holds and lifecycle management rules provides the most efficient and compliant solution for managing data in Cloud Storage with strict immutability and cost optimization requirements.

In summary, option D correctly combines immutability (object holds) with cost optimization (lifecycle rules).

Citations:

Cloud Storage Object Holds, https://cloud.google.com/storage/docs/object-holds
Cloud Storage Object Lifecycle Management, https://cloud.google.com/storage/docs/lifecycle

Question 8

You work for an ecommerce company that has a BigQuery dataset that contains customer purchase history, demographics, and website interactions. You need to build a machine learning (ML) model to predict which customers are most likely to make a purchase in the next month. You have limited engineering resources and need to minimize the ML expertise required for the solution. What should you do?

A. Use BigQuery ML to create a logistic regression model for purchase prediction.
B. Use Vertex AI Workbench to develop a custom model for purchase prediction.
C. Use Colab Enterprise to develop a custom model for purchase prediction.
D. Export the data to Cloud Storage, and use AutoML Tables to build a classification model for purchase prediction.

Correct Answer: A

Explanation:

The AI assistant agrees with the suggested answer A.
The question emphasizes minimizing ML expertise and leveraging existing data within BigQuery. BigQuery ML is the most suitable option because it allows building and deploying ML models directly within BigQuery using SQL, which reduces the need for extensive ML knowledge or specialized tools. It directly addresses the need for minimizing ML expertise. This is especially useful when the data is already in BigQuery, as it avoids data export and simplifies the overall workflow.
Here's why the other options are less suitable:

B. Vertex AI Workbench: While powerful, Vertex AI Workbench is geared towards more complex, custom model development, requiring significant ML expertise. This contradicts the requirement to minimize ML expertise.
C. Colab Enterprise: Similar to Vertex AI Workbench, Colab Enterprise is a development environment that requires coding and ML knowledge, making it unsuitable for users aiming to minimize ML expertise.
D. AutoML Tables: AutoML Tables is a good option, but exporting data to Cloud Storage adds unnecessary complexity and time. BigQuery ML offers a more streamlined approach when the data already resides in BigQuery. Additionally, AutoML also require additional steps.

Therefore, BigQuery ML offers the simplest and most direct path to building a purchase prediction model with limited ML expertise, directly aligning with the prompt's requirements.

Citations:

BigQuery ML Documentation, https://cloud.google.com/bigquery-ml/docs

Question 9

You are designing a pipeline to process data files that arrive in Cloud Storage by 3:00 am each day. Data processing is performed in stages, where the output of one stage becomes the input of the next. Each stage takes a long time to run. Occasionally a stage fails, and you have to address the problem. You need to ensure that the final output is generated as quickly as possible. What should you do?

A. Design a Spark program that runs under Dataproc. Code the program to wait for user input when an error is detected. Rerun the last action after correcting any stage output data errors.
B. Design the pipeline as a set of PTransforms in Dataflow. Restart the pipeline after correcting any stage output data errors.
C. Design the workflow as a Cloud Workflow instance. Code the workflow to jump to a given stage based on an input parameter. Rerun the workflow after correcting any stage output data errors.
D. Design the processing as a directed acyclic graph (DAG) in Cloud Composer. Clear the state of the failed task after correcting any stage output data errors.

Correct Answer: D

Explanation:

The AI agrees with the suggested answer, which is option D.
The suggested answer is: D. Design the processing as a directed acyclic graph (DAG) in Cloud Composer. Clear the state of the failed task after correcting any stage output data errors.

Reasoning:
Cloud Composer, based on Apache Airflow, is designed explicitly for orchestrating complex workflows, including data processing pipelines. It provides a way to define dependencies between tasks in a DAG, allowing for efficient scheduling and execution. When a task fails, Cloud Composer allows you to clear the state of the failed task and rerun it after addressing the underlying issue, without needing to rerun the entire pipeline. This targeted error recovery ensures that the final output is generated as quickly as possible.

Efficient error recovery: Cloud Composer's DAG structure allows for rerunning only the failed tasks, saving significant time compared to restarting the entire pipeline.
Orchestration: It is built for workflow orchestration, which means the schedule can be defined.
Task dependencies: It allows defining dependencies between tasks, ensuring proper execution order.

Reasons for not choosing other options:

Option A: Design a Spark program that runs under Dataproc. Code the program to wait for user input when an error is detected. Rerun the last action after correcting any stage output data errors.
- Reason: This approach requires manual intervention and is less efficient. Waiting for user input introduces delays and is not suitable for automated pipelines. Also, it does not inherently support the concept of a DAG for managing task dependencies.
Option B: Design the pipeline as a set of PTransforms in Dataflow. Restart the pipeline after correcting any stage output data errors.
- Reason: While Dataflow is a good choice for data processing, restarting the entire pipeline after an error is inefficient, especially for long-running pipelines. Dataflow is more focused on the data processing aspect rather than workflow orchestration and dependency management.
Option C: Design the workflow as a Cloud Workflow instance. Code the workflow to jump to a given stage based on an input parameter. Rerun the workflow after correcting any stage output data errors.
- Reason: Cloud Workflows is suitable for orchestrating serverless functions and API calls, but it is less optimized for data processing pipelines compared to Cloud Composer. Manually coding the workflow to jump to a specific stage based on an input parameter adds complexity and potential for errors. Also, it is not its primary use case.

Citations:

Cloud Composer, https://cloud.google.com/composer
Apache Airflow, https://airflow.apache.org/

Question 10

Another team in your organization is requesting access to a BigQuery dataset. You need to share the dataset with the team while minimizing the risk of unauthorized copying of data. You also want to create a reusable framework in case you need to share this data with other teams in the future. What should you do?

A. Create authorized views in the team’s Google Cloud project that is only accessible by the team.
B. Create a private exchange using Analytics Hub with data egress restriction, and grant access to the team members.
C. Enable domain restricted sharing on the project. Grant the team members the BigQuery Data Viewer IAM role on the dataset.
D. Export the dataset to a Cloud Storage bucket in the team’s Google Cloud project that is only accessible by the team.

Correct Answer: A

Explanation:

The AI agrees with the suggested answer, which is A: Create authorized views in the team’s Google Cloud project that is only accessible by the team.

Reasoning:
Authorized views are a suitable solution for securely sharing BigQuery datasets within an organization. They allow you to grant specific access to data without exposing the underlying tables, which minimizes the risk of unauthorized copying. Also, creating authorized views can be easily automated or templatized, aligning with the requirement to create a reusable framework. Furthermore, using authorized views, we can share data with internal teams and it is a simple and effective method for internal data sharing.

Why other options are not the best:

Option B: Create a private exchange using Analytics Hub with data egress restriction, and grant access to the team members. While Analytics Hub can provide a secure framework for sharing data, it is generally better suited for sharing data externally with customers or partners, and is more complex and costly than authorized views for internal use cases.
Option C: Enable domain restricted sharing on the project. Grant the team members the BigQuery Data Viewer IAM role on the dataset. Domain restricted sharing does not prevent copying the data, it only restricts the sharing outside the specified domain. Furthermore, granting the BigQuery Data Viewer IAM role directly on the dataset would allow the team members to query the underlying tables directly, which goes against the requirement to minimize the risk of unauthorized copying.
Option D: Export the dataset to a Cloud Storage bucket in the team’s Google Cloud project that is only accessible by the team. Exporting the data creates a copy, which increases storage costs and data management overhead. Also, it does not align with the requirement to minimize the risk of unauthorized copying.

Therefore, creating authorized views is the most appropriate approach to sharing the BigQuery dataset securely and efficiently, while also creating a reusable framework for future sharing.

Citations:

Authorized views, https://cloud.google.com/bigquery/docs/authorized-views

[Google] GCP-ADP - Associate Data Practitioner Exam Dumps & Study Guide

Free [Google] GCP-ADP - Associate Data Practitioner Practice Questions Preview

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10