GCP Data Engineer Streaming and Dataflow Guide 2026 : Cert-Pass Blog

Official source note

gcp data engineer streaming and dataflow is the main focus of this page, and the safest way to study it is to keep the exam hub open while you work through the official facts and the service selection patterns. Google describes GCP Professional Data Engineer as a certification that validates practical cloud literacy, service selection, and scenario thinking. The main Cert Pass hub remains /exams/google-gcp-professional-data-engineer.

Exam facts

Exam name: GCP Professional Data Engineer
Exam slug: google-gcp-professional-data-engineer
Vendor: Google
Cert Pass landing page: /exams/google-gcp-professional-data-engineer
Study hub: /exams/google-gcp-professional-data-engineer
Official vendor page: Google Cloud Professional Data Engineer

Why this article exists

The goal here is not to collect trivia. The goal is to build the habit of reading a scenario, identifying the category, and choosing the simplest service that directly fits the requirement.

Fast study map

Use the exam hub twice during review: /exams/google-gcp-professional-data-engineer and /exams/google-gcp-professional-data-engineer. Those internal links should act as the stable anchor for practice, revision, and final review.

GCP Data Engineer Streaming and Dataflow Guide 2026

Streaming is one of the highest value topics on the GCP Professional Data Engineer exam because it combines ingestion, processing, reliability, and storage choices in one flow. The exam often describes a live event source, a transformation requirement, and a serving destination. The candidate must know which service handles each step and which Dataflow or Pub/Sub concept solves the hidden failure mode.

The core pattern is straightforward. Pub/Sub receives events. Dataflow processes them with Apache Beam. BigQuery stores the analytical output or Bigtable serves low latency reads. The difficult part is the detail inside the stream: event time, windows, triggers, watermarks, late data, idempotency, and failure handling.

Exam Facts

Detail	Value
Exam	GCP Professional Data Engineer
Exam code	google-gcp-professional-data-engineer
Vendor	Google
Questions	50
Time limit	90 minutes
Passing score	70 percent
Retirement date	None published
Replacement exam	None published

Domain Breakdown

Domain	Weight	What to focus on
Ingesting and processing the data	25.0 percent	Pub/Sub, Dataflow, stream design, and event time reasoning
Designing data processing systems	22.0 percent	Architecture choice, managed services, and scaling decisions
Storing the data	20.0 percent	BigQuery versus Bigtable versus Cloud Storage destinations
Maintaining and automating data workloads	18.0 percent	Reliability, dead letter handling, monitoring, and replay
Preparing and using data for analysis	15.0 percent	Analytics sinks, reporting, and downstream consumption

The default streaming architecture

The default Google Cloud streaming path is one of the most repeated patterns on the exam.

Producers publish events to Pub/Sub. A Dataflow pipeline reads those events, transforms them, handles duplicates or late arrivals, and then writes validated output to BigQuery or Bigtable. Cloud Monitoring watches backlog, error rates, and latency. A dead letter topic or quarantine path captures records that cannot be processed.

That pattern is powerful because it separates concerns. Pub/Sub handles buffering and decoupling. Dataflow handles transformation and stream semantics. BigQuery handles analytics. Bigtable handles serving when the access pattern needs low latency key based lookups. The exam often gives a business problem and expects the candidate to reconstruct this pattern without being told directly.

Pub/Sub fundamentals

Pub/Sub is the ingestion layer for event driven systems. It accepts messages from publishers and delivers them to subscribers. It is useful when producers and consumers must be decoupled or when traffic is bursty.

The exam may test the difference between a topic and a subscription. A topic is what publishers write to. A subscription is how subscribers receive messages. Pull subscriptions are common in processing systems such as Dataflow. Push subscriptions are more useful when Pub/Sub should deliver events to an HTTPS endpoint.

Pub/Sub also appears in reliability questions. Messages may be delivered more than once, so downstream processing should be idempotent. The exam may ask how to handle repeated delivery or how to avoid duplicate effects. The answer usually involves deduplication logic, unique event identifiers, or an idempotent sink strategy.

Dataflow and Apache Beam

Dataflow is the managed Apache Beam service on Google Cloud. It is the expected answer when the prompt requires stream processing, windowing, or event time aware transformations. Dataflow can run both batch and streaming pipelines, which makes it even more useful in exam scenarios where one code path should work for both bounded and unbounded data.

Apache Beam concepts are worth understanding because the exam often uses the language of PCollections, PTransforms, and runners even when the service is Dataflow. The important point is not the terminology alone. The important point is that Dataflow provides managed scaling, parallel processing, and stream semantics without requiring the team to manage clusters manually.

Windows, triggers, and event time

Windowing is one of the most testable streaming concepts. A continuous stream must be grouped into finite chunks before analytics can be computed. Fixed windows are used when the prompt asks for counts per minute or per hour. Sliding windows are used when the prompt asks for rolling summaries. Session windows are used when activity should be grouped by periods of user interaction.

Event time is more important than processing time for many real world streams. If records can arrive out of order, the pipeline should reason about when events happened rather than when they were processed. That is why allowed lateness, triggers, and watermarks matter.

A watermark is the pipeline’s estimate of how complete the data is up to a given point in event time. If data arrives after the watermark has passed, it is late. The correct handling depends on the business requirement. Some pipelines can ignore late data. Others must accept it for a limited time and update prior results.

Late data and recovery patterns

The exam often hides late data behind a business narrative. A prompt may say that mobile devices send telemetry minutes after the fact, or that network outages cause delayed delivery. If the analytics must still be accurate, allowed lateness and event time windows become central.

Triggers define when the window emits results. A watermark trigger emits when the system believes the window is complete. A processing time trigger emits after a time delay. A count based trigger emits after a number of elements. The correct answer depends on whether the prompt emphasizes completeness, timeliness, or early partial output.

Late data handling is often combined with accumulation mode. In accumulated mode, the result updates as late records arrive. In discard mode, only the first result is kept. The exam may not require memorizing every option, but it does require knowing that these controls exist and are used for stream correctness.

Exactly once and duplicate prevention

Streaming systems must be designed for duplication. Pub/Sub delivery may be at least once, and workers can restart. That means the downstream design should tolerate repeated records.

The exam may ask for exactly once semantics. In practice, that means a combination of the right source, the right sink, and correct pipeline configuration. A strong answer often includes unique event identifiers and a sink that can deduplicate or merge records safely. If the destination is BigQuery, the design should not rely on blind append alone when duplicates would create incorrect results.

The important reasoning step is this. A message pipeline does not become correct simply because the transport is managed. Correctness comes from the full design, including source, transformation, and sink behavior.

Dead letter handling and quarantine

A healthy streaming system does not let bad records block good ones. That is why dead letter topics and quarantine tables matter. If some records fail validation or parsing, they should be isolated for later review rather than causing the full pipeline to stop.

The exam may describe malformed events, schema drift, or records that fail downstream processing. The right design usually routes those failures to a separate topic or table, then alerts operations so the team can investigate. This pattern keeps the main path moving while preserving the ability to reprocess failed records later.

Sink choice: BigQuery or Bigtable

BigQuery is the right sink when the output is for analytics, reporting, or SQL based exploration. Bigtable is the better choice when the output must be read at very low latency by applications or services. The exam often uses this distinction directly.

If the prompt mentions dashboards, trend analysis, or business reporting, BigQuery is usually correct. If the prompt mentions user profile lookups, recommendation serving, or operational key value access, Bigtable is often the better answer. Choosing the sink correctly is just as important as choosing Pub/Sub and Dataflow correctly.

Monitoring and operational control

Streaming systems need visibility. Cloud Monitoring and logging are important when the prompt mentions backlog, lag, errors, latency, or alerting. A mature streaming architecture tracks whether messages are building up in Pub/Sub, whether Dataflow is falling behind, whether errors are increasing, and whether the sink is healthy.

The exam often tests operational reasoning through failure stories. A team may discover that the pipeline is slow, that records are dropping, or that duplicates are appearing. The correct response is usually to improve observability, idempotency, or scaling, not to hide the symptom with a manual cleanup script.

Common streaming service patterns

Requirement	Usually best fit	Why
Ingest many event sources	Pub/Sub	Decoupling and elastic buffering
Transform streaming data	Dataflow	Managed Apache Beam and event time support
Count events per minute	Dataflow with fixed windows	Windowed aggregation over streaming data
Track rolling activity	Dataflow with sliding windows	Overlapping time windows for ongoing metrics
Group user sessions	Dataflow with session windows	Gap based event grouping
Handle late events	Event time, allowed lateness, triggers	Maintains accuracy when events arrive out of order
Capture bad records	Dead letter topic or quarantine table	Keeps the main pipeline moving
Serve low latency lookups	Bigtable	Fast serving access pattern
Run analytics	BigQuery	SQL reporting and warehouse use case

Common exam traps

The exam uses several predictable traps. One is suggesting Cloud Scheduler or a cron job for a live event problem. Another is suggesting Cloud SQL as the central streaming hub. Another is suggesting that a batch load is acceptable when the scenario clearly needs live or near real time processing.

Another trap is confusing ingestion with processing. Pub/Sub can receive data, but it does not transform it. Dataflow can transform data, but it is not the best storage layer. BigQuery can analyze data, but it is not the first choice for stream transformation. The correct answer usually comes from respecting those boundaries.

A simple mental model for stream questions

Identify the producer and the event type.
Decide whether the system needs buffering, transformation, or both.
Choose Pub/Sub for ingestion when decoupling is needed.
Choose Dataflow for stream processing and event time logic.
Choose BigQuery for analytics or Bigtable for serving.
Add monitoring, deduplication, and dead letter handling if reliability matters.

That model solves many of the scenario prompts on the exam because the structure of the question often mirrors the structure of the platform.

Study priorities

Learn Pub/Sub topic and subscription behavior.
Learn how Dataflow uses windows, triggers, watermarks, and lateness.
Practice choosing between BigQuery and Bigtable as sinks.
Review idempotency and duplicate handling.
Review dead letter and quarantine patterns.
Practice monitoring and backlog scenarios.

Final guidance

Streaming questions look complex, but they usually reduce to a few clean decisions. The candidate who can identify the ingestion layer, the processing layer, the sink, and the failure handling pattern will recognize the best answer quickly.

For a focused next step, start with the GCP Professional Data Engineer landing page.

Extended official revision notes

Google Cloud Professional Data Engineer Exam Course

1. Exam Overview

What the exam is testing

The Google Cloud Professional Data Engineer exam validates whether you can design, build, operationalize, secure, monitor, optimize, and troubleshoot data processing systems on Google Cloud. The exam is not mainly a memorization test. It tests whether you can read a business scenario, identify the real constraint, eliminate tempting but wrong services, and choose the most managed, secure, reliable, scalable, and cost-effective Google Cloud architecture.

The current official standard exam guide organizes the exam into five domains:

Designing data processing systems
Ingesting and processing the data
Storing the data
Preparing and using data for analysis
Maintaining and automating data workloads

The standard exam is 2 hours, contains 40-50 multiple-choice and multiple-select questions, and is available in English and Japanese. The certification is valid for 2 years. Google also offers a shorter renewal exam for active certificate holders.

How to think like the exam

In most questions, the correct answer is the option that best balances the following priorities:

Meet the business requirement first. Do not optimize cost if the scenario says the system is mission critical and needs low latency or high availability.
Use managed services when possible. Prefer BigQuery, Dataflow, Pub/Sub, Dataplex, Cloud Composer, Cloud Data Fusion, or managed databases over self-managed infrastructure unless the scenario explicitly requires custom frameworks or legacy compatibility.
Use least privilege and governance by design. Correct answers often include IAM, service accounts, policy tags, authorized views, row-level security, Cloud KMS, VPC Service Controls, Dataplex, and audit logging.
Select the service by access pattern. Storage questions are usually about reads, writes, latency, consistency, scale, relational requirements, and analytics requirements.
Avoid operational burden. If two options work, the exam usually favors the one with less manual administration and fewer custom scripts.
Watch the words. Terms such as streaming, near real time, ACID, global consistency, time-series, low latency, petabyte analytics, serverless, batch, orchestration, CDC, data residency, and PII usually point to specific services.

How to use this course

Use this file as a compressed revision guide. Start with the exam domains, then study the service-selection tables, then practice the architecture patterns and traps. The question bank behind this course repeatedly emphasizes BigQuery, Dataflow, Pub/Sub, Cloud Storage, Cloud SQL, Bigtable, Spanner, Dataplex, Cloud Composer, Dataform, IAM, policy tags, authorized views, Analytics Hub, BI Engine, and monitoring/optimization patterns. Those are the highest-yield services and decisions for this exam.

2. Exam Domains

Domain	Official Weight	Priority	What matters most
Designing data processing systems	~22%	Very high	Security, compliance, governance, reliability, portability, migrations, project/dataset/table architecture
Ingesting and processing the data	~25%	Highest	Dataflow, Pub/Sub, Beam, Dataproc, Data Fusion, pipeline design, batch vs streaming, orchestration, CI/CD
Storing the data	~20%	Very high	BigQuery, Cloud Storage, BigLake, Bigtable, Spanner, Cloud SQL, Firestore, Memorystore, data lake/warehouse design
Preparing and using data for analysis	~15%	Medium	BI patterns, BigQuery performance, BigQuery ML, data sharing, Analytics Hub, masking, DLP, reports
Maintaining and automating data workloads	~18%	High	Cost, reservations, Composer DAGs, monitoring, troubleshooting, fault tolerance, quotas, restarts, replication

Priority notes

The highest-value preparation sequence is:

Ingesting and processing because it is the largest domain and appears in many scenario questions.
Designing systems because it drives architecture, security, compliance, and migration decisions.
Storing data because service selection is one of the most common exam traps.
Maintaining and automating because the exam frequently asks how to reduce cost, automate pipelines, monitor failures, and recover safely.
Preparing and using data for analysis because it often appears as BigQuery performance, BI, sharing, masking, and ML-readiness scenarios.

3. Start-to-Finish Study Path

Phase 1: Foundation

Learn the core Google Cloud data platform map:

Cloud Storage: durable object storage, landing zones, data lakes, lifecycle policies, raw files.
BigQuery: serverless data warehouse for analytics, SQL, partitioning, clustering, BI, sharing, ML.
Pub/Sub: asynchronous messaging and streaming ingestion backbone.
Dataflow: Apache Beam-based managed batch and streaming processing.
Dataproc: managed Spark/Hadoop for existing Spark, Hadoop, Hive, or custom ecosystem workloads.
Cloud Composer: managed Apache Airflow for DAG orchestration.
Dataform: SQL-based transformations and dependency management in BigQuery.
Cloud Data Fusion: visual/low-code ETL and integration.
Dataplex: data lake governance, cataloging, discovery, zones, policy-based governance.
Cloud SQL / AlloyDB / Spanner / Bigtable / Firestore / Memorystore: operational databases selected by workload pattern.

Foundation goal: when you see a scenario, quickly map it to the correct managed service.

Phase 2: Intermediate

Study decision patterns:

Batch vs streaming.
Data warehouse vs data lake vs lakehouse.
Relational vs NoSQL vs object storage.
OLTP vs OLAP.
Low latency serving vs analytical scanning.
Event ingestion vs workflow orchestration.
Transformation vs orchestration.
Governance enforcement vs naming conventions.
Regional data residency vs global access.
Persistent clusters vs ephemeral job clusters.

Intermediate goal: eliminate wrong answers based on access pattern, operational burden, security, and cost.

Phase 3: Advanced

Focus on architecture and tradeoffs:

End-to-end ingestion: source → Pub/Sub/Storage/Datastream → Dataflow/Data Fusion/Dataproc → BigQuery/BigLake/Cloud Storage.
Change data capture: Datastream into Cloud Storage or BigQuery-oriented pipelines.
Real-time analytics: Pub/Sub + Dataflow + BigQuery.
Historical migration: Storage Transfer Service, BigQuery Data Transfer Service, Database Migration Service, Transfer Appliance.
Data governance: Dataplex + IAM + policy tags + row-level security + DLP + audit logs.
Cost and performance: partitioning, clustering, materialized views, BI Engine, reservations, query optimization.
Reliability: idempotent pipelines, dead-letter topics, retries, checkpoints, validation, monitoring, multi-region strategy.

Advanced goal: choose the best architecture under multiple constraints.

Phase 4: Final Review

During the last week:

Memorize service-selection rules.
Review BigQuery partitioning, clustering, authorized views, policy tags, BI Engine, reservations, and materialized views.
Review Dataflow streaming concepts: windows, triggers, late data, watermarks, exactly-once-style processing behavior, dead-letter handling, and idempotent sinks.
Review storage selection: BigQuery vs Bigtable vs Spanner vs Cloud SQL vs Firestore vs Cloud Storage.
Review security patterns: least privilege, service accounts, CMEK, DLP, row/column security, audit logging, VPC Service Controls.
Review operational patterns: Cloud Monitoring, Cloud Logging, Composer DAG retries, quotas, BigQuery job history, cost controls.

4. Core Concepts by Domain

Domain 1: Designing data processing systems

Concepts

This domain tests whether you can design secure, compliant, reliable, flexible, portable, and migration-ready data systems.

Key ideas:

Map business requirements to architecture before choosing tools.
Separate environments such as development, test, and production using projects, datasets, service accounts, and IAM boundaries.
Enforce least privilege at the narrowest practical level.
Govern sensitive data using policy tags, row-level security, authorized views, Cloud DLP, and Cloud KMS.
Design for data residency by choosing correct BigQuery dataset locations, Cloud Storage bucket locations, and regional services.
Build validation and reconciliation into migration plans.
Prefer managed, repeatable, auditable patterns over manual scripts.

Services

Requirement	Service or feature to think about
Fine-grained access in BigQuery	IAM, authorized views, row-level security, column-level security, policy tags
PII discovery or masking	Cloud DLP / Sensitive Data Protection, BigQuery masking policies
Encryption key control	Cloud KMS, CMEK
Governance and catalog	Dataplex, Dataplex Catalog
Data residency	Region-specific datasets and buckets
Bulk data transfer	Storage Transfer Service, Transfer Appliance
Database migration	Database Migration Service
CDC from databases	Datastream
Warehouse migration	BigQuery Data Transfer Service, staged loads, validation queries

Patterns

Secure analytics pattern

Store raw data in restricted datasets.
Apply least privilege IAM.
Use policy tags for sensitive columns.
Use row-level security for business unit or geography restrictions.
Expose curated datasets or authorized views to analysts.
Audit access with Cloud Audit Logs and BigQuery job history.

Data residency pattern

Keep raw data in the required region.
Avoid copying sensitive data across regions unless explicitly allowed.
Share only aggregated, anonymized, or policy-approved data when global reporting is needed.
Validate that BigQuery dataset location, Cloud Storage bucket location, and processing region align.

Migration pattern

Analyze current state and stakeholder requirements.
Choose migration tool based on source and volume.
Perform staged loads.
Reconcile row counts, checksums, and business aggregates.
Run parallel validation before cutover.
Switch consumers only after validation passes.

Traps

Choosing naming conventions instead of IAM or policy enforcement.
Granting BigQuery Admin to analysts for convenience.
Copying regulated data to another region just to simplify reporting.
Migrating with a one-time copy and no validation.
Using Pub/Sub for historical bulk migration when Storage Transfer Service, BigQuery Data Transfer Service, Database Migration Service, or Transfer Appliance fits better.
Building custom governance scripts instead of using Dataplex, IAM, policy tags, DLP, and audit logs.

Domain 2: Ingesting and processing the data

Concepts

This is the largest exam domain. It tests whether you can plan, build, deploy, and operationalize batch and streaming pipelines.

Key ideas:

Pub/Sub ingests events. It is not a transformation engine or scheduler.
Dataflow transforms batch and streaming data. It is the default managed service for Apache Beam pipelines.
Dataproc runs Spark/Hadoop workloads. Use it for existing Spark/Hadoop/Hive ecosystems or custom cluster-level dependencies.
Cloud Data Fusion provides visual ETL. Use it when low-code integration and connectors matter.
Cloud Composer orchestrates workflows. It coordinates jobs; it should not perform heavy transformations inside DAG code.
Dataform manages SQL transformations in BigQuery. Use it for SQL modeling, dependencies, testing, and repeatable BigQuery transformations.

Services

Scenario	Best fit	Why
Streaming events from apps or devices	Pub/Sub	Decoupled, durable event ingestion
Real-time transformation and enrichment	Dataflow	Managed Beam streaming with windows, triggers, state, late data handling
Existing Spark jobs with custom libraries	Dataproc	Managed Spark/Hadoop compatibility
Low-code ETL with connectors	Cloud Data Fusion	Visual pipeline design and integration
SQL transformations in BigQuery	Dataform	Versioned SQL workflows and dependency management
Scheduling complex multi-step workflows	Cloud Composer	Airflow DAG orchestration
Lightweight service orchestration	Workflows	Serverless orchestration of APIs and services
CDC from operational databases	Datastream	Change streams for replication and analytics ingestion
Kafka integration	Pub/Sub Kafka connector or managed integration pattern	Avoid self-managing unless required

Patterns

Streaming analytics pattern

Producers publish events to Pub/Sub.
Dataflow reads Pub/Sub messages.
Dataflow validates, enriches, windows, and handles late data.
Invalid records go to a dead-letter topic or error table.
Clean results are written to BigQuery or Bigtable depending on access pattern.
Monitoring alerts on backlog, errors, latency, and failed workers.

Batch file ingestion pattern

Files land in Cloud Storage.
Cloud Composer or Eventarc triggers processing.
Dataflow, Dataproc, Data Fusion, or BigQuery loads transform the data.
BigQuery stores curated analytics tables.
Dataform manages SQL models and tests.

CDC analytics pattern

Datastream captures database changes.
Changes land in Cloud Storage or feed downstream pipelines.
Dataflow or BigQuery transformations merge updates into curated tables.
Validate latency, ordering, duplicates, and schema evolution.

Traps

Using Cloud Composer as the data processing engine instead of orchestrator.
Using Pub/Sub as a database or long-term store.
Choosing Dataproc for new simple serverless pipelines when Dataflow is more managed.
Choosing Dataflow for a legacy Spark job that must preserve Spark APIs and dependencies; Dataproc is usually better.
Ignoring late-arriving data in streaming questions.
Writing custom retry scripts instead of using managed retries, dead-letter topics, idempotent processing, and monitoring.
Loading malformed records directly into production tables instead of quarantine/error tables.

Domain 3: Storing the data

Concepts

This domain tests service selection and data platform design. Most storage questions are solved by identifying the access pattern.

Ask these questions:

Is it analytics or transactions?
Is it structured, semi-structured, unstructured, or file/object data?
Is the workload read-heavy, write-heavy, or mixed?
Is strong global consistency required?
Is millisecond key-value access required?
Is SQL relational modeling required?
Is horizontal scale more important than joins?
Is the primary access pattern large scans or point lookups?

Services

Service	Use when	Avoid when
BigQuery	Petabyte-scale SQL analytics, BI, ELT, warehouse, federated analytics	OLTP transactions, low-latency point updates, application serving database
Cloud Storage	Raw files, landing zone, data lake objects, backups, archives	Relational queries, high-frequency row updates, transactional workloads
BigLake	Governed lakehouse access over data lakes and BigQuery	Simple object storage without governance needs
Bigtable	Massive scale, low-latency key-value/wide-column, time-series, IoT, high write throughput	Ad hoc SQL analytics, joins, transactions across rows
Spanner	Globally scalable relational database with strong consistency and high availability	Simple single-region relational apps where Cloud SQL is enough
Cloud SQL	Managed MySQL/PostgreSQL/SQL Server for traditional relational apps	Global horizontal relational scale or massive analytics
AlloyDB	High-performance PostgreSQL-compatible operational workloads	Non-PostgreSQL workloads or analytical warehouse use cases
Firestore	Serverless document database for mobile/web apps with flexible documents	Analytical scans, relational joins, warehouse workloads
Memorystore	Managed Redis/Memcached caching, session state, low-latency cache	Durable source of truth or analytics
Dataplex	Governed data lake/platform management, cataloging, zones	Replacing storage or processing engines

Patterns

Warehouse pattern

Use BigQuery for curated analytical tables.
Partition by time or ingestion date for large time-based data.
Cluster by frequently filtered/joined columns.
Use materialized views or summary tables for repeated expensive aggregations.
Use authorized views, row-level security, and policy tags for controlled access.

Lakehouse pattern

Land raw data in Cloud Storage.
Govern discovery and access with Dataplex and BigLake.
Process raw to curated zones using Dataflow, Dataproc, Data Fusion, or BigQuery.
Serve analytics in BigQuery.

Operational serving pattern

Use Cloud SQL or AlloyDB for traditional relational application databases.
Use Spanner for global relational scale with strong consistency.
Use Bigtable for extremely high-throughput key-value/time-series workloads.
Use Firestore for serverless document-based mobile/web apps.
Use Memorystore for caching, not durable storage.

Traps

Choosing BigQuery for user-facing low-latency transactional workloads.
Choosing Cloud Storage when users need SQL analytics and BI without defining BigQuery or BigLake access.
Choosing Cloud SQL for global scale and multi-region strong consistency when Spanner is the better fit.
Choosing Bigtable when the question requires joins, SQL, or multi-row transactions.
Choosing Firestore for analytical reporting.
Treating Memorystore as durable storage.
Ignoring lifecycle policies and storage class cost optimization for Cloud Storage.

Domain 4: Preparing and using data for analysis

Concepts

This domain tests whether you can prepare data for BI, ML, sharing, visualization, and secure analysis.

Key ideas:

BigQuery is central for analytical preparation.
Performance tuning often involves partitioning, clustering, pruning, query rewrite, materialized views, BI Engine, and avoiding SELECT *.
Security for analysis often uses row-level security, column-level security, policy tags, authorized views, masking, IAM, and DLP.
BigQuery ML is useful when the model can be trained and used directly in BigQuery with SQL.
Vertex AI is more appropriate for advanced custom ML workflows, feature stores, training pipelines, deployment, and MLOps.
Analytics Hub is used for controlled data sharing and publishing datasets.

Services

Requirement	Best fit
Fast BI dashboard over BigQuery	BI Engine, materialized views, aggregated tables, partitioning/clustering
Repeated expensive aggregations	Materialized views or scheduled summary tables
Controlled dataset sharing	Analytics Hub, authorized views, dataset access controls
Mask or classify PII	Cloud DLP / Sensitive Data Protection, policy tags, masking
SQL-based ML directly on warehouse data	BigQuery ML
Advanced custom ML lifecycle	Vertex AI
Prepare unstructured text for RAG	Embeddings, vector search patterns, preprocessing pipelines, governed storage

Patterns

BI performance pattern

Partition large fact tables by date.
Cluster by high-cardinality filter or join columns where useful.
Precompute common aggregates.
Use materialized views for repeated deterministic aggregations.
Enable BI Engine for interactive dashboards.
Avoid scanning unnecessary columns and partitions.

Secure sharing pattern

Publish curated datasets rather than raw sensitive data.
Use Analytics Hub for managed sharing.
Use authorized views to expose limited views.
Use policy tags and masking for sensitive columns.
Use row-level security for tenant, region, or department filtering.

ML preparation pattern

Use BigQuery to clean, join, and prepare structured features.
Use BigQuery ML for SQL-native models and simple forecasting/classification/regression.
Use Vertex AI for custom training, feature management, pipelines, endpoints, and MLOps.
For unstructured data/RAG, prepare chunking, metadata, embeddings, access controls, and retrieval quality validation.

Traps

Solving every slow dashboard with more slots before optimizing table design and query patterns.
Exposing raw datasets when curated views or shared listings are safer.
Using BigQuery ML for complex custom ML lifecycle requirements better served by Vertex AI.
Forgetting PII masking and column-level controls in analytics environments.
Using CSV exports as the primary sharing mechanism when Analytics Hub or BigQuery sharing is better.

Domain 5: Maintaining and automating data workloads

Concepts

This domain tests operational excellence: automation, cost, monitoring, troubleshooting, capacity, fault tolerance, and repeatability.

Key ideas:

Use Cloud Composer for DAG-based orchestration.
Use Dataform for repeatable SQL transformations in BigQuery.
Use Cloud Monitoring and Cloud Logging for observability.
Use BigQuery admin tools, job history, INFORMATION_SCHEMA, audit logs, reservations, and slot metrics for BigQuery troubleshooting.
Use reservations and Editions for predictable BigQuery capacity management.
Use partitioning, clustering, query optimization, and lifecycle policies before blindly scaling resources.
Use retries, idempotency, checkpoints, dead-letter topics, and alerting for failure management.

Services

Operational need	Service or feature
DAG scheduling and dependencies	Cloud Composer
SQL transformation dependencies	Dataform
API/service orchestration	Workflows
Pipeline metrics and alerts	Cloud Monitoring
Logs and error analysis	Cloud Logging
BigQuery troubleshooting	BigQuery admin panel, job history, INFORMATION_SCHEMA, audit logs
Capacity management	BigQuery Editions, reservations, slots
Cost controls	Budgets, labels, partitioning, clustering, lifecycle policies, reservations
Fault tolerance	Retries, checkpoints, dead-letter queues, idempotent writes, regional design

Patterns

Reliable DAG pattern

Keep DAG tasks small and idempotent.
Use retries with backoff.
Store secrets in Secret Manager, not in code.
Use service accounts with least privilege.
Monitor SLA misses, retries, and failures.
Do not run large transformations inside the scheduler process.

BigQuery cost pattern

Partition large tables by date or ingestion time.
Cluster when filters repeatedly use specific columns.
Avoid SELECT *.
Use dry runs and query estimates.
Use materialized views or summary tables for repeated calculations.
Use reservations for predictable capacity or workload isolation.
Use labels for chargeback and monitoring.

Failure recovery pattern

Detect with Cloud Monitoring and Logging.
Quarantine bad data.
Use idempotent reprocessing.
Use checkpoints or replayable sources.
Validate output completeness and quality.
Alert owners and maintain runbooks.

Traps

Using cron on a VM instead of Cloud Composer or managed scheduling for critical pipelines.
Scaling BigQuery slots without checking partition pruning, clustering, and query design.
Using persistent Dataproc clusters for infrequent jobs when ephemeral clusters reduce cost.
Ignoring quotas and billing alerts.
Not designing pipelines to restart safely.
Monitoring only infrastructure metrics while ignoring data quality and business-level pipeline metrics.

5. Service Selection Guide

BigQuery vs Cloud Storage vs BigLake

Need	Choose	Why
SQL analytics over structured data	BigQuery	Serverless warehouse optimized for analytical SQL
Raw files and low-cost durable object storage	Cloud Storage	Best landing zone and data lake object store
Governed analytics over lake data	BigLake	Provides lakehouse-style governance and BigQuery integration
BI dashboards with interactive SQL	BigQuery + BI Engine	Optimized for analytics and BI acceleration
Archive historical files cheaply	Cloud Storage lifecycle classes	Cost-effective long-term object storage

Dataflow vs Dataproc vs Cloud Data Fusion vs Dataform

Need	Choose	Do not choose
New batch/stream processing with managed Beam	Dataflow	Dataproc, unless Spark/Hadoop compatibility is required
Existing Spark/Hadoop/Hive jobs	Dataproc	Dataflow, if rewriting would add risk
Visual ETL with connectors and low-code development	Cloud Data Fusion	Hand-coded Dataflow, unless custom code is required
SQL transformations in BigQuery	Dataform	Composer DAG code for SQL dependency modeling
Heavy transformation logic inside scheduler	Dataflow/Dataproc/BigQuery/Dataform	Cloud Composer alone

Pub/Sub vs Cloud Composer vs Workflows

Need	Choose	Why
Event ingestion and decoupling	Pub/Sub	Messaging backbone for asynchronous events
Scheduled DAGs with dependencies	Cloud Composer	Managed Apache Airflow
Lightweight API orchestration	Workflows	Serverless orchestration without Airflow overhead
Streaming transformation	Dataflow reading Pub/Sub	Pub/Sub alone does not transform

Cloud SQL vs AlloyDB vs Spanner

Need	Choose	Why
Traditional managed relational database	Cloud SQL	Simple managed MySQL/PostgreSQL/SQL Server
High-performance PostgreSQL-compatible workload	AlloyDB	Better performance and availability for PostgreSQL-compatible apps
Globally distributed relational database with strong consistency	Spanner	Horizontal scale, global availability, strong consistency
Analytical warehouse	BigQuery	Operational databases are not optimized for petabyte analytics

Bigtable vs Firestore vs Memorystore

Need	Choose	Why
Massive key-value/wide-column reads/writes, time-series, IoT	Bigtable	Low-latency high-throughput NoSQL at scale
Mobile/web document database	Firestore	Serverless document model and app synchronization patterns
Cache/session store	Memorystore	Managed Redis/Memcached low-latency cache
SQL analytics	BigQuery	NoSQL/caches are wrong for warehouse analytics

BigQuery security features

Requirement	Feature
Hide sensitive columns	Policy tags, column-level security, masking
Restrict rows by user/tenant/region	Row-level security
Share only a curated subset	Authorized views
Detect/classify sensitive data	Cloud DLP / Sensitive Data Protection
Control encryption keys	Cloud KMS / CMEK
Audit usage	Cloud Audit Logs, BigQuery job history

BigQuery performance and cost features

Requirement	Feature
Reduce scanned data by date	Partitioning
Improve repeated filters/joins	Clustering
Accelerate repeated aggregations	Materialized views or summary tables
Improve dashboard performance	BI Engine
Predictable capacity and isolation	Reservations, slots, BigQuery Editions
Avoid accidental large scans	Dry runs, maximum bytes billed, query review

6. Architecture Patterns

Pattern 1: Real-time analytics pipeline

Scenario: App events must appear in dashboards within seconds or minutes.

Recommended solution: Pub/Sub → Dataflow → BigQuery → Looker/BI tool.

Why: Pub/Sub handles ingestion and decoupling. Dataflow handles streaming transformation, windows, late data, and enrichment. BigQuery serves analytics.