GCP Professional Data Engineer Exam Questions and Answers 2026 : Cert-Pass Blog

Official source note

gcp professional data engineer exam questions is the main focus of this page, and the safest way to study it is to keep the exam hub open while you work through the official facts and the service selection patterns. Google describes GCP Professional Data Engineer as a certification that validates practical cloud literacy, service selection, and scenario thinking. The main Cert Pass hub remains /exams/google-gcp-professional-data-engineer.

Exam facts

Exam name: GCP Professional Data Engineer
Exam slug: google-gcp-professional-data-engineer
Vendor: Google
Cert Pass landing page: /exams/google-gcp-professional-data-engineer
Study hub: /exams/google-gcp-professional-data-engineer
Official vendor page: Google Cloud Professional Data Engineer

Why this article exists

The goal here is not to collect trivia. The goal is to build the habit of reading a scenario, identifying the category, and choosing the simplest service that directly fits the requirement.

Fast study map

Use the exam hub twice during review: /exams/google-gcp-professional-data-engineer and /exams/google-gcp-professional-data-engineer. Those internal links should act as the stable anchor for practice, revision, and final review.

GCP Professional Data Engineer Exam Questions and Answers 2026

This guide presents exam style scenarios for the GCP Professional Data Engineer certification. The goal is not to memorize a question bank. The goal is to train pattern recognition so the right Google Cloud service becomes obvious when a workload description appears under time pressure.

Each scenario below emphasizes a common exam decision point: ingestion, processing, storage, governance, automation, or analytics. The correct answer is not only a service name. It is the reason that service fits the requirement better than the distractors.

Exam Facts

Detail	Value
Exam	GCP Professional Data Engineer
Exam code	google-gcp-professional-data-engineer
Vendor	Google
Questions	50
Time limit	90 minutes
Passing score	70 percent
Retirement date	None published
Replacement exam	None published

Domain Breakdown

Domain	Weight	What to focus on
Ingesting and processing the data	25.0 percent	Pub/Sub, Dataflow, event time, scaling, and streaming design
Designing data processing systems	22.0 percent	Architecture fit, migration, managed services, and tradeoffs
Storing the data	20.0 percent	BigQuery, Bigtable, Cloud Storage, Firestore, and Cloud SQL choices
Maintaining and automating data workloads	18.0 percent	Scheduling, retries, reliability, monitoring, and orchestration
Preparing and using data for analysis	15.0 percent	Warehouse design, access controls, reporting, and consumption patterns

How to use these questions

Read each scenario once without stopping. Then identify the workload shape, the service boundary, and the operational constraint. The correct choice usually becomes clear when the question is translated into plain language.

If a prompt mentions a continuous event stream, look first at Pub/Sub and Dataflow. If a prompt mentions SQL analytics, look first at BigQuery. If a prompt mentions very low latency reads at huge scale, look at Bigtable or Firestore instead of the warehouse. If a prompt mentions scheduled pipelines with dependencies, look at Cloud Composer or another orchestration pattern.

Question 1

A retail platform needs to ingest purchase events from many applications, buffer traffic spikes, and keep the ingestion layer separate from the processing layer. Which service is the best first choice?

Answer: Pub/Sub.

Pub/Sub is the correct ingestion layer because it decouples producers from consumers and can absorb bursts without forcing each producer to wait for downstream processing. Dataflow may process the events later, but the question is specifically about reliable ingestion and buffering. Cloud Storage is not a message bus, and Cloud SQL is not built for this pattern.

Question 2

A fraud detection pipeline must read streaming events, enrich them with reference data, handle out of order arrivals, and write curated results for analytics. Which service best performs the processing step?

Answer: Dataflow.

Dataflow is the managed Apache Beam service for batch and streaming processing. It supports event time logic, windows, triggers, allowed lateness, and parallel scaling. Pub/Sub only ingests. BigQuery only stores or analyzes. Dataproc could run Spark, but the managed streaming pattern expected on the exam is usually Dataflow.

Question 3

A company stores three years of transaction data in BigQuery. Most queries filter by transaction date, and the current design scans too much data. What change should be made first?

Answer: Partition the table by date.

Partitioning reduces scanned bytes when queries filter on the partition column. Clustering can also help, but partitioning is the primary answer when a time filter is mentioned. Rewriting the query or adding a view does not solve the underlying scan problem as directly as partitioning does.

Question 4

A marketing team runs the same aggregation query many times per day on the same BigQuery data. The query is expensive and rarely changes in structure. What should be used?

Answer: Materialized views.

Materialized views are designed for repeated, predictable aggregations. They store precomputed results and can reduce query cost and latency. Partitioning helps with scanned data, but the question describes repeated aggregation logic, which is the key signal for materialized views.

Question 5

A dashboard in Looker queries BigQuery and takes several seconds per tile. The business needs much faster interactive response time on repeated queries. What should be enabled?

Answer: BI Engine.

BI Engine is the in memory acceleration layer for BigQuery. It is designed for fast dashboard and reporting performance. A materialized view might help, but BI Engine is the most direct match when the prompt focuses on sub second or very fast interactive dashboards.

Question 6

An application needs a document database for user profiles, nested attributes, and flexible schema changes. Which Google Cloud service fits best?

Answer: Firestore.

Firestore is the better choice for document oriented application data. Cloud SQL is relational and is not the best answer when the prompt emphasizes flexible documents. BigQuery is analytical rather than transactional for application profiles. Bigtable is a wide column database, not a document database.

Question 7

A global analytics platform needs a high throughput key value store for very low latency lookups, such as serving personalized recommendations. Which service best fits?

Answer: Bigtable.

Bigtable is the best match when the scenario calls for low latency, massive scale, and wide column access patterns. Firestore is more document oriented. Cloud SQL is not designed for the same scale of read serving. BigQuery is for analysis, not low latency operational serving.

Question 8

A data team wants a daily workflow that extracts data, runs validation, transforms it, and publishes a curated table. The workflow must support retries and clear task dependencies. What orchestration option is best?

Answer: Cloud Composer.

Cloud Composer, which is managed Airflow, is the standard orchestration choice for multi step workflows with dependencies, retries, and monitoring. The exam favors explicit DAG based orchestration when the problem is scheduling and control, not just transformation.

Question 9

A team needs to track prediction drift and operationalize a machine learning model built from warehouse data. Which service should be used for the model lifecycle?

Answer: Vertex AI.

Vertex AI handles model training, deployment, prediction, and monitoring. The prompt is not just about building a model. It is about operationalizing it. BigQuery ML can be useful when the model stays inside SQL, but Vertex AI is the broader answer when model lifecycle management is central.

Question 10

A data engineer needs to protect sensitive columns in a warehouse while allowing analysts to query the rest of the table. The access policy should be centralized and maintainable. What is the best approach?

Answer: Use policy tags or an authorized view depending on the access pattern.

Policy tags are the best answer when column level control is the priority. Authorized views are useful when a controlled subset of the data should be exposed through a separate logical layer. Row level security is correct only when the requirement is filtering rows by user or group. The exam often tests which of these access models fits the scenario.

What these questions teach

The best exam answers usually follow a repeatable logic.

First, identify whether the question is about ingestion, transformation, storage, analysis, governance, or automation.

Second, choose the managed Google Cloud service that naturally fits that layer.

Third, add the reliability or access control feature that the scenario requires.

This order keeps the answer grounded and prevents over engineering. A question that needs ingestion does not need a warehouse answer. A question that needs a warehouse does not need a VM answer. A question that needs orchestration does not need a manual script.

Common wrong answer patterns

The exam frequently uses distractors that are technically related but operationally wrong. Cloud SQL appears in places where BigQuery or Bigtable is a better fit. Dataproc appears where Dataflow is the better managed service. Manual scripts appear where orchestration or managed transfer services are needed. BigQuery appears in places where Firestore or Bigtable is a better low latency serving choice.

Another common trap is ignoring the phrase that changes the answer. Event time, late data, repeated aggregations, and low latency serving all matter. The prompt often includes the clue that separates the best answer from the merely plausible one.

Study recommendation

Practice by rewriting each scenario in plain language. Ask what the system must ingest, what it must transform, what it must store, and what the consumer expects. Then match the answer to the service boundary. That habit is often more effective than reading long service descriptions in isolation.

For a structured study path and practice coverage, start at the GCP Professional Data Engineer landing page.

Extended official revision notes

Google Cloud Professional Data Engineer Exam Course

1. Exam Overview

What the exam is testing

The Google Cloud Professional Data Engineer exam validates whether you can design, build, operationalize, secure, monitor, optimize, and troubleshoot data processing systems on Google Cloud. The exam is not mainly a memorization test. It tests whether you can read a business scenario, identify the real constraint, eliminate tempting but wrong services, and choose the most managed, secure, reliable, scalable, and cost-effective Google Cloud architecture.

The current official standard exam guide organizes the exam into five domains:

Designing data processing systems
Ingesting and processing the data
Storing the data
Preparing and using data for analysis
Maintaining and automating data workloads

The standard exam is 2 hours, contains 40-50 multiple-choice and multiple-select questions, and is available in English and Japanese. The certification is valid for 2 years. Google also offers a shorter renewal exam for active certificate holders.

How to think like the exam

In most questions, the correct answer is the option that best balances the following priorities:

Meet the business requirement first. Do not optimize cost if the scenario says the system is mission critical and needs low latency or high availability.
Use managed services when possible. Prefer BigQuery, Dataflow, Pub/Sub, Dataplex, Cloud Composer, Cloud Data Fusion, or managed databases over self-managed infrastructure unless the scenario explicitly requires custom frameworks or legacy compatibility.
Use least privilege and governance by design. Correct answers often include IAM, service accounts, policy tags, authorized views, row-level security, Cloud KMS, VPC Service Controls, Dataplex, and audit logging.
Select the service by access pattern. Storage questions are usually about reads, writes, latency, consistency, scale, relational requirements, and analytics requirements.
Avoid operational burden. If two options work, the exam usually favors the one with less manual administration and fewer custom scripts.
Watch the words. Terms such as streaming, near real time, ACID, global consistency, time-series, low latency, petabyte analytics, serverless, batch, orchestration, CDC, data residency, and PII usually point to specific services.

How to use this course

Use this file as a compressed revision guide. Start with the exam domains, then study the service-selection tables, then practice the architecture patterns and traps. The question bank behind this course repeatedly emphasizes BigQuery, Dataflow, Pub/Sub, Cloud Storage, Cloud SQL, Bigtable, Spanner, Dataplex, Cloud Composer, Dataform, IAM, policy tags, authorized views, Analytics Hub, BI Engine, and monitoring/optimization patterns. Those are the highest-yield services and decisions for this exam.

2. Exam Domains

Domain	Official Weight	Priority	What matters most
Designing data processing systems	~22%	Very high	Security, compliance, governance, reliability, portability, migrations, project/dataset/table architecture
Ingesting and processing the data	~25%	Highest	Dataflow, Pub/Sub, Beam, Dataproc, Data Fusion, pipeline design, batch vs streaming, orchestration, CI/CD
Storing the data	~20%	Very high	BigQuery, Cloud Storage, BigLake, Bigtable, Spanner, Cloud SQL, Firestore, Memorystore, data lake/warehouse design
Preparing and using data for analysis	~15%	Medium	BI patterns, BigQuery performance, BigQuery ML, data sharing, Analytics Hub, masking, DLP, reports
Maintaining and automating data workloads	~18%	High	Cost, reservations, Composer DAGs, monitoring, troubleshooting, fault tolerance, quotas, restarts, replication

Priority notes

The highest-value preparation sequence is:

Ingesting and processing because it is the largest domain and appears in many scenario questions.
Designing systems because it drives architecture, security, compliance, and migration decisions.
Storing data because service selection is one of the most common exam traps.
Maintaining and automating because the exam frequently asks how to reduce cost, automate pipelines, monitor failures, and recover safely.
Preparing and using data for analysis because it often appears as BigQuery performance, BI, sharing, masking, and ML-readiness scenarios.

3. Start-to-Finish Study Path

Phase 1: Foundation

Learn the core Google Cloud data platform map:

Cloud Storage: durable object storage, landing zones, data lakes, lifecycle policies, raw files.
BigQuery: serverless data warehouse for analytics, SQL, partitioning, clustering, BI, sharing, ML.
Pub/Sub: asynchronous messaging and streaming ingestion backbone.
Dataflow: Apache Beam-based managed batch and streaming processing.
Dataproc: managed Spark/Hadoop for existing Spark, Hadoop, Hive, or custom ecosystem workloads.
Cloud Composer: managed Apache Airflow for DAG orchestration.
Dataform: SQL-based transformations and dependency management in BigQuery.
Cloud Data Fusion: visual/low-code ETL and integration.
Dataplex: data lake governance, cataloging, discovery, zones, policy-based governance.
Cloud SQL / AlloyDB / Spanner / Bigtable / Firestore / Memorystore: operational databases selected by workload pattern.

Foundation goal: when you see a scenario, quickly map it to the correct managed service.

Phase 2: Intermediate

Study decision patterns:

Batch vs streaming.
Data warehouse vs data lake vs lakehouse.
Relational vs NoSQL vs object storage.
OLTP vs OLAP.
Low latency serving vs analytical scanning.
Event ingestion vs workflow orchestration.
Transformation vs orchestration.
Governance enforcement vs naming conventions.
Regional data residency vs global access.
Persistent clusters vs ephemeral job clusters.

Intermediate goal: eliminate wrong answers based on access pattern, operational burden, security, and cost.

Phase 3: Advanced

Focus on architecture and tradeoffs:

End-to-end ingestion: source → Pub/Sub/Storage/Datastream → Dataflow/Data Fusion/Dataproc → BigQuery/BigLake/Cloud Storage.
Change data capture: Datastream into Cloud Storage or BigQuery-oriented pipelines.
Real-time analytics: Pub/Sub + Dataflow + BigQuery.
Historical migration: Storage Transfer Service, BigQuery Data Transfer Service, Database Migration Service, Transfer Appliance.
Data governance: Dataplex + IAM + policy tags + row-level security + DLP + audit logs.
Cost and performance: partitioning, clustering, materialized views, BI Engine, reservations, query optimization.
Reliability: idempotent pipelines, dead-letter topics, retries, checkpoints, validation, monitoring, multi-region strategy.

Advanced goal: choose the best architecture under multiple constraints.

Phase 4: Final Review

During the last week:

Memorize service-selection rules.
Review BigQuery partitioning, clustering, authorized views, policy tags, BI Engine, reservations, and materialized views.
Review Dataflow streaming concepts: windows, triggers, late data, watermarks, exactly-once-style processing behavior, dead-letter handling, and idempotent sinks.
Review storage selection: BigQuery vs Bigtable vs Spanner vs Cloud SQL vs Firestore vs Cloud Storage.
Review security patterns: least privilege, service accounts, CMEK, DLP, row/column security, audit logging, VPC Service Controls.
Review operational patterns: Cloud Monitoring, Cloud Logging, Composer DAG retries, quotas, BigQuery job history, cost controls.

4. Core Concepts by Domain

Domain 1: Designing data processing systems

Concepts

This domain tests whether you can design secure, compliant, reliable, flexible, portable, and migration-ready data systems.

Key ideas:

Map business requirements to architecture before choosing tools.
Separate environments such as development, test, and production using projects, datasets, service accounts, and IAM boundaries.
Enforce least privilege at the narrowest practical level.
Govern sensitive data using policy tags, row-level security, authorized views, Cloud DLP, and Cloud KMS.
Design for data residency by choosing correct BigQuery dataset locations, Cloud Storage bucket locations, and regional services.
Build validation and reconciliation into migration plans.
Prefer managed, repeatable, auditable patterns over manual scripts.

Services

Requirement	Service or feature to think about
Fine-grained access in BigQuery	IAM, authorized views, row-level security, column-level security, policy tags
PII discovery or masking	Cloud DLP / Sensitive Data Protection, BigQuery masking policies
Encryption key control	Cloud KMS, CMEK
Governance and catalog	Dataplex, Dataplex Catalog
Data residency	Region-specific datasets and buckets
Bulk data transfer	Storage Transfer Service, Transfer Appliance
Database migration	Database Migration Service
CDC from databases	Datastream
Warehouse migration	BigQuery Data Transfer Service, staged loads, validation queries

Patterns

Secure analytics pattern

Store raw data in restricted datasets.
Apply least privilege IAM.
Use policy tags for sensitive columns.
Use row-level security for business unit or geography restrictions.
Expose curated datasets or authorized views to analysts.
Audit access with Cloud Audit Logs and BigQuery job history.

Data residency pattern

Keep raw data in the required region.
Avoid copying sensitive data across regions unless explicitly allowed.
Share only aggregated, anonymized, or policy-approved data when global reporting is needed.
Validate that BigQuery dataset location, Cloud Storage bucket location, and processing region align.

Migration pattern

Analyze current state and stakeholder requirements.
Choose migration tool based on source and volume.
Perform staged loads.
Reconcile row counts, checksums, and business aggregates.
Run parallel validation before cutover.
Switch consumers only after validation passes.

Traps

Choosing naming conventions instead of IAM or policy enforcement.
Granting BigQuery Admin to analysts for convenience.
Copying regulated data to another region just to simplify reporting.
Migrating with a one-time copy and no validation.
Using Pub/Sub for historical bulk migration when Storage Transfer Service, BigQuery Data Transfer Service, Database Migration Service, or Transfer Appliance fits better.
Building custom governance scripts instead of using Dataplex, IAM, policy tags, DLP, and audit logs.

Domain 2: Ingesting and processing the data

Concepts

This is the largest exam domain. It tests whether you can plan, build, deploy, and operationalize batch and streaming pipelines.

Key ideas:

Pub/Sub ingests events. It is not a transformation engine or scheduler.
Dataflow transforms batch and streaming data. It is the default managed service for Apache Beam pipelines.
Dataproc runs Spark/Hadoop workloads. Use it for existing Spark/Hadoop/Hive ecosystems or custom cluster-level dependencies.
Cloud Data Fusion provides visual ETL. Use it when low-code integration and connectors matter.
Cloud Composer orchestrates workflows. It coordinates jobs; it should not perform heavy transformations inside DAG code.
Dataform manages SQL transformations in BigQuery. Use it for SQL modeling, dependencies, testing, and repeatable BigQuery transformations.

Services

Scenario	Best fit	Why
Streaming events from apps or devices	Pub/Sub	Decoupled, durable event ingestion
Real-time transformation and enrichment	Dataflow	Managed Beam streaming with windows, triggers, state, late data handling
Existing Spark jobs with custom libraries	Dataproc	Managed Spark/Hadoop compatibility
Low-code ETL with connectors	Cloud Data Fusion	Visual pipeline design and integration
SQL transformations in BigQuery	Dataform	Versioned SQL workflows and dependency management
Scheduling complex multi-step workflows	Cloud Composer	Airflow DAG orchestration
Lightweight service orchestration	Workflows	Serverless orchestration of APIs and services
CDC from operational databases	Datastream	Change streams for replication and analytics ingestion
Kafka integration	Pub/Sub Kafka connector or managed integration pattern	Avoid self-managing unless required

Patterns

Streaming analytics pattern

Producers publish events to Pub/Sub.
Dataflow reads Pub/Sub messages.
Dataflow validates, enriches, windows, and handles late data.
Invalid records go to a dead-letter topic or error table.
Clean results are written to BigQuery or Bigtable depending on access pattern.
Monitoring alerts on backlog, errors, latency, and failed workers.

Batch file ingestion pattern

Files land in Cloud Storage.
Cloud Composer or Eventarc triggers processing.
Dataflow, Dataproc, Data Fusion, or BigQuery loads transform the data.
BigQuery stores curated analytics tables.
Dataform manages SQL models and tests.

CDC analytics pattern

Datastream captures database changes.
Changes land in Cloud Storage or feed downstream pipelines.
Dataflow or BigQuery transformations merge updates into curated tables.
Validate latency, ordering, duplicates, and schema evolution.

Traps

Using Cloud Composer as the data processing engine instead of orchestrator.
Using Pub/Sub as a database or long-term store.
Choosing Dataproc for new simple serverless pipelines when Dataflow is more managed.
Choosing Dataflow for a legacy Spark job that must preserve Spark APIs and dependencies; Dataproc is usually better.
Ignoring late-arriving data in streaming questions.
Writing custom retry scripts instead of using managed retries, dead-letter topics, idempotent processing, and monitoring.
Loading malformed records directly into production tables instead of quarantine/error tables.

Domain 3: Storing the data

Concepts

This domain tests service selection and data platform design. Most storage questions are solved by identifying the access pattern.

Ask these questions:

Is it analytics or transactions?
Is it structured, semi-structured, unstructured, or file/object data?
Is the workload read-heavy, write-heavy, or mixed?
Is strong global consistency required?
Is millisecond key-value access required?
Is SQL relational modeling required?
Is horizontal scale more important than joins?
Is the primary access pattern large scans or point lookups?

Services

Service	Use when	Avoid when
BigQuery	Petabyte-scale SQL analytics, BI, ELT, warehouse, federated analytics	OLTP transactions, low-latency point updates, application serving database
Cloud Storage	Raw files, landing zone, data lake objects, backups, archives	Relational queries, high-frequency row updates, transactional workloads
BigLake	Governed lakehouse access over data lakes and BigQuery	Simple object storage without governance needs
Bigtable	Massive scale, low-latency key-value/wide-column, time-series, IoT, high write throughput	Ad hoc SQL analytics, joins, transactions across rows
Spanner	Globally scalable relational database with strong consistency and high availability	Simple single-region relational apps where Cloud SQL is enough
Cloud SQL	Managed MySQL/PostgreSQL/SQL Server for traditional relational apps	Global horizontal relational scale or massive analytics
AlloyDB	High-performance PostgreSQL-compatible operational workloads	Non-PostgreSQL workloads or analytical warehouse use cases
Firestore	Serverless document database for mobile/web apps with flexible documents	Analytical scans, relational joins, warehouse workloads
Memorystore	Managed Redis/Memcached caching, session state, low-latency cache	Durable source of truth or analytics
Dataplex	Governed data lake/platform management, cataloging, zones	Replacing storage or processing engines

Patterns

Warehouse pattern

Use BigQuery for curated analytical tables.
Partition by time or ingestion date for large time-based data.
Cluster by frequently filtered/joined columns.
Use materialized views or summary tables for repeated expensive aggregations.
Use authorized views, row-level security, and policy tags for controlled access.

Lakehouse pattern

Land raw data in Cloud Storage.
Govern discovery and access with Dataplex and BigLake.
Process raw to curated zones using Dataflow, Dataproc, Data Fusion, or BigQuery.
Serve analytics in BigQuery.

Operational serving pattern

Use Cloud SQL or AlloyDB for traditional relational application databases.
Use Spanner for global relational scale with strong consistency.
Use Bigtable for extremely high-throughput key-value/time-series workloads.
Use Firestore for serverless document-based mobile/web apps.
Use Memorystore for caching, not durable storage.

Traps

Choosing BigQuery for user-facing low-latency transactional workloads.
Choosing Cloud Storage when users need SQL analytics and BI without defining BigQuery or BigLake access.
Choosing Cloud SQL for global scale and multi-region strong consistency when Spanner is the better fit.
Choosing Bigtable when the question requires joins, SQL, or multi-row transactions.
Choosing Firestore for analytical reporting.
Treating Memorystore as durable storage.
Ignoring lifecycle policies and storage class cost optimization for Cloud Storage.

Domain 4: Preparing and using data for analysis

Concepts

This domain tests whether you can prepare data for BI, ML, sharing, visualization, and secure analysis.

Key ideas:

BigQuery is central for analytical preparation.
Performance tuning often involves partitioning, clustering, pruning, query rewrite, materialized views, BI Engine, and avoiding SELECT *.
Security for analysis often uses row-level security, column-level security, policy tags, authorized views, masking, IAM, and DLP.
BigQuery ML is useful when the model can be trained and used directly in BigQuery with SQL.
Vertex AI is more appropriate for advanced custom ML workflows, feature stores, training pipelines, deployment, and MLOps.
Analytics Hub is used for controlled data sharing and publishing datasets.

Services

Requirement	Best fit
Fast BI dashboard over BigQuery	BI Engine, materialized views, aggregated tables, partitioning/clustering
Repeated expensive aggregations	Materialized views or scheduled summary tables
Controlled dataset sharing	Analytics Hub, authorized views, dataset access controls
Mask or classify PII	Cloud DLP / Sensitive Data Protection, policy tags, masking
SQL-based ML directly on warehouse data	BigQuery ML
Advanced custom ML lifecycle	Vertex AI
Prepare unstructured text for RAG	Embeddings, vector search patterns, preprocessing pipelines, governed storage

Patterns

BI performance pattern

Partition large fact tables by date.
Cluster by high-cardinality filter or join columns where useful.
Precompute common aggregates.
Use materialized views for repeated deterministic aggregations.
Enable BI Engine for interactive dashboards.
Avoid scanning unnecessary columns and partitions.

Secure sharing pattern

Publish curated datasets rather than raw sensitive data.
Use Analytics Hub for managed sharing.
Use authorized views to expose limited views.
Use policy tags and masking for sensitive columns.
Use row-level security for tenant, region, or department filtering.

ML preparation pattern

Use BigQuery to clean, join, and prepare structured features.
Use BigQuery ML for SQL-native models and simple forecasting/classification/regression.
Use Vertex AI for custom training, feature management, pipelines, endpoints, and MLOps.
For unstructured data/RAG, prepare chunking, metadata, embeddings, access controls, and retrieval quality validation.

Traps

Solving every slow dashboard with more slots before optimizing table design and query patterns.
Exposing raw datasets when curated views or shared listings are safer.
Using BigQuery ML for complex custom ML lifecycle requirements better served by Vertex AI.
Forgetting PII masking and column-level controls in analytics environments.
Using CSV exports as the primary sharing mechanism when Analytics Hub or BigQuery sharing is better.

Domain 5: Maintaining and automating data workloads

Concepts

This domain tests operational excellence: automation, cost, monitoring, troubleshooting, capacity, fault tolerance, and repeatability.

Key ideas:

Use Cloud Composer for DAG-based orchestration.
Use Dataform for repeatable SQL transformations in BigQuery.
Use Cloud Monitoring and Cloud Logging for observability.
Use BigQuery admin tools, job history, INFORMATION_SCHEMA, audit logs, reservations, and slot metrics for BigQuery troubleshooting.
Use reservations and Editions for predictable BigQuery capacity management.
Use partitioning, clustering, query optimization, and lifecycle policies before blindly scaling resources.
Use retries, idempotency, checkpoints, dead-letter topics, and alerting for failure management.

Services

Operational need	Service or feature
DAG scheduling and dependencies	Cloud Composer
SQL transformation dependencies	Dataform
API/service orchestration	Workflows
Pipeline metrics and alerts	Cloud Monitoring
Logs and error analysis	Cloud Logging
BigQuery troubleshooting	BigQuery admin panel, job history, INFORMATION_SCHEMA, audit logs
Capacity management	BigQuery Editions, reservations, slots
Cost controls	Budgets, labels, partitioning, clustering, lifecycle policies, reservations
Fault tolerance	Retries, checkpoints, dead-letter queues, idempotent writes, regional design

Patterns

Reliable DAG pattern

Keep DAG tasks small and idempotent.
Use retries with backoff.
Store secrets in Secret Manager, not in code.
Use service accounts with least privilege.
Monitor SLA misses, retries, and failures.
Do not run large transformations inside the scheduler process.

BigQuery cost pattern

Partition large tables by date or ingestion time.
Cluster when filters repeatedly use specific columns.
Avoid SELECT *.
Use dry runs and query estimates.
Use materialized views or summary tables for repeated calculations.
Use reservations for predictable capacity or workload isolation.
Use labels for chargeback and monitoring.

Failure recovery pattern

Detect with Cloud Monitoring and Logging.
Quarantine bad data.
Use idempotent reprocessing.
Use checkpoints or replayable sources.
Validate output completeness and quality.
Alert owners and maintain runbooks.

Traps

Using cron on a VM instead of Cloud Composer or managed scheduling for critical pipelines.
Scaling BigQuery slots without checking partition pruning, clustering, and query design.
Using persistent Dataproc clusters for infrequent jobs when ephemeral clusters reduce cost.
Ignoring quotas and billing alerts.
Not designing pipelines to restart safely.
Monitoring only infrastructure metrics while ignoring data quality and business-level pipeline metrics.

5. Service Selection Guide

BigQuery vs Cloud Storage vs BigLake

Need	Choose	Why
SQL analytics over structured data	BigQuery	Serverless warehouse optimized for analytical SQL
Raw files and low-cost durable object storage	Cloud Storage	Best landing zone and data lake object store
Governed analytics over lake data	BigLake	Provides lakehouse-style governance and BigQuery integration
BI dashboards with interactive SQL	BigQuery + BI Engine	Optimized for analytics and BI acceleration
Archive historical files cheaply	Cloud Storage lifecycle classes	Cost-effective long-term object storage

Dataflow vs Dataproc vs Cloud Data Fusion vs Dataform

Need	Choose	Do not choose
New batch/stream processing with managed Beam	Dataflow	Dataproc, unless Spark/Hadoop compatibility is required
Existing Spark/Hadoop/Hive jobs	Dataproc	Dataflow, if rewriting would add risk
Visual ETL with connectors and low-code development	Cloud Data Fusion	Hand-coded Dataflow, unless custom code is required
SQL transformations in BigQuery	Dataform	Composer DAG code for SQL dependency modeling
Heavy transformation logic inside scheduler	Dataflow/Dataproc/BigQuery/Dataform	Cloud Composer alone

Pub/Sub vs Cloud Composer vs Workflows

Need	Choose	Why
Event ingestion and decoupling	Pub/Sub	Messaging backbone for asynchronous events
Scheduled DAGs with dependencies	Cloud Composer	Managed Apache Airflow
Lightweight API orchestration	Workflows	Serverless orchestration without Airflow overhead
Streaming transformation	Dataflow reading Pub/Sub	Pub/Sub alone does not transform

Cloud SQL vs AlloyDB vs Spanner

Need	Choose	Why
Traditional managed relational database	Cloud SQL	Simple managed MySQL/PostgreSQL/SQL Server
High-performance PostgreSQL-compatible workload	AlloyDB	Better performance and availability for PostgreSQL-compatible apps
Globally distributed relational database with strong consistency	Spanner	Horizontal scale, global availability, strong consistency
Analytical warehouse	BigQuery	Operational databases are not optimized for petabyte analytics

Bigtable vs Firestore vs Memorystore

Need	Choose	Why
Massive key-value/wide-column reads/writes, time-series, IoT	Bigtable	Low-latency high-throughput NoSQL at scale
Mobile/web document database	Firestore	Serverless document model and app synchronization patterns
Cache/session store	Memorystore	Managed Redis/Memcached low-latency cache
SQL analytics	BigQuery	NoSQL/caches are wrong for warehouse analytics

BigQuery security features

Requirement	Feature
Hide sensitive columns	Policy tags, column-level security, masking
Restrict rows by user/tenant/region	Row-level security
Share only a curated subset	Authorized views
Detect/classify sensitive data	Cloud DLP / Sensitive Data Protection
Control encryption keys	Cloud KMS / CMEK
Audit usage	Cloud Audit Logs, BigQuery job history

BigQuery performance and cost features

Requirement	Feature
Reduce scanned data by date	Partitioning
Improve repeated filters/joins	Clustering
Accelerate repeated aggregations	Materialized views or summary tables
Improve dashboard performance	BI Engine
Predictable capacity and isolation	Reservations, slots, BigQuery Editions
Avoid accidental large scans	Dry runs, maximum bytes billed, query review

6. Architecture Patterns

Pattern 1: Real-time analytics pipeline

Scenario: App events must appear in dashboards within seconds or minutes.

Recommended solution: Pub/Sub → Dataflow → BigQuery → Looker/BI tool.

Why: Pub/Sub handles ingestion and decoupling. Dataflow handles streaming transformation, windows, late data, and enrichment. BigQuery serves analytics.