Question 1
You are building an ML model to detect anomalies in real-time sensor data. You will use Pub/Sub to handle incoming requests. You want to store the results for analytics and visualization. How should you configure the pipeline?
Based on the question and the discussion, I agree with the suggested answer A.
Here's a breakdown of the reasoning:
- Data ingestion and processing (1): Dataflow is a fully managed, serverless stream and batch data processing service. It's ideal for reading data from Pub/Sub, performing any necessary transformations, and writing the data to a storage solution. This makes it well-suited for real-time anomaly detection pipelines where you need to process streaming sensor data.
- Anomaly detection (2): AI Platform (now Vertex AI) provides the tools and infrastructure to train and deploy machine learning models. You can deploy your anomaly detection model to Vertex AI and use it to score the data coming from Dataflow in real-time.
- Storage for analytics and visualization (3): BigQuery is a fully managed, serverless data warehouse that is well-suited for storing and analyzing large datasets. It's a good choice for storing the results of your anomaly detection model so that you can perform analytics and create visualizations.
Why the other options are not as suitable:
- Option B: DataProc is suitable for batch processing using Hadoop/Spark, which isn't the best fit for real-time streaming data. AutoML, while useful for some ML tasks, isn't a direct replacement for deploying a custom anomaly detection model and Cloud Bigtable is a NoSQL database, less ideal for analytics compared to BigQuery.
- Option C: While BigQuery is a good choice for storing the results and AutoML could be used to potentially train the model, Cloud Functions are more suitable for event-driven tasks, not for the continuous real-time processing needed in this scenario.
- Option D: While AI Platform is appropriate for hosting the ML model, using Cloud Storage as the primary storage for the results is less suitable for analytics and visualization purposes compared to BigQuery. Cloud Storage is better suited for storing raw data or files.
Therefore, Dataflow -> AI Platform -> BigQuery provides the most complete and efficient solution for this real-time anomaly detection pipeline.