Cert-Pass
Log in Sign up
Google Google Certified auto_stories Free Compressed Course

GCP Professional Data Engineer Certification Course

bolt Everything you need to pass : in one free course.

12 expert modules derived from 1100+ real exam questions. Covers every domain, exam trap, and scenario : organized by blueprint weight so you study what matters most.

check_circle 100% free · No account needed · 12 modules
12
Modules
1100+
Questions
58
Domains
GCP Professional Data Engineer
Google

About This Course

GCP Professional Data Engineer · 12 modules

This course covers every domain tested on the GCP Professional Data Engineer exam. Based on our 1100+ real practice questions and prepared by certification experts.

info What you'll learn:

  • Every exam domain with detailed explanations
  • Common exam traps that catch unprepared candidates
  • Key concepts, syntax, and configurations
  • Real-world scenarios from actual exam questions
  • Quick-reference cheat sheets for last-minute review

1. Exam Overview

What the exam is testing

The Google Cloud Professional Data Engineer exam validates whether you can design, build, operationalize, secure, monitor, optimize, and troubleshoot data processing systems on Google Cloud. The exam is not mainly a memorization test. It tests whether you can read a business scenario, identify the real constraint, eliminate tempting but wrong services, and choose the most managed, secure, reliable, scalable, and cost-effective Google Cloud architecture.

The current official standard exam guide organizes the exam into five domains:

  1. Designing data processing systems
  2. Ingesting and processing the data
  3. Storing the data
  4. Preparing and using data for analysis
  5. Maintaining and automating data workloads

The standard exam is 2 hours, contains 40-50 multiple-choice and multiple-select questions, and is available in English and Japanese. The certification is valid for 2 years. Google also offers a shorter renewal exam for active certificate holders.

How to think like the exam

In most questions, the correct answer is the option that best balances the following priorities:

  1. Meet the business requirement first. Do not optimize cost if the scenario says the system is mission critical and needs low latency or high availability.
  2. Use managed services when possible. Prefer BigQuery, Dataflow, Pub/Sub, Dataplex, Cloud Composer, Cloud Data Fusion, or managed databases over self-managed infrastructure unless the scenario explicitly requires custom frameworks or legacy compatibility.
  3. Use least privilege and governance by design. Correct answers often include IAM, service accounts, policy tags, authorized views, row-level security, Cloud KMS, VPC Service Controls, Dataplex, and audit logging.
  4. Select the service by access pattern. Storage questions are usually about reads, writes, latency, consistency, scale, relational requirements, and analytics requirements.
  5. Avoid operational burden. If two options work, the exam usually favors the one with less manual administration and fewer custom scripts.
  6. Watch the words. Terms such as streaming, near real time, ACID, global consistency, time-series, low latency, petabyte analytics, serverless, batch, orchestration, CDC, data residency, and PII usually point to specific services.

How to use this course

Use this file as a compressed revision guide. Start with the exam domains, then study the service-selection tables, then practice the architecture patterns and traps. The question bank behind this course repeatedly emphasizes BigQuery, Dataflow, Pub/Sub, Cloud Storage, Cloud SQL, Bigtable, Spanner, Dataplex, Cloud Composer, Dataform, IAM, policy tags, authorized views, Analytics Hub, BI Engine, and monitoring/optimization patterns. Those are the highest-yield services and decisions for this exam.


2. Exam Domains

Domain Official Weight Priority What matters most
Designing data processing systems ~22% Very high Security, compliance, governance, reliability, portability, migrations, project/dataset/table architecture
Ingesting and processing the data ~25% Highest Dataflow, Pub/Sub, Beam, Dataproc, Data Fusion, pipeline design, batch vs streaming, orchestration, CI/CD
Storing the data ~20% Very high BigQuery, Cloud Storage, BigLake, Bigtable, Spanner, Cloud SQL, Firestore, Memorystore, data lake/warehouse design
Preparing and using data for analysis ~15% Medium BI patterns, BigQuery performance, BigQuery ML, data sharing, Analytics Hub, masking, DLP, reports
Maintaining and automating data workloads ~18% High Cost, reservations, Composer DAGs, monitoring, troubleshooting, fault tolerance, quotas, restarts, replication

Priority notes

The highest-value preparation sequence is:

  1. Ingesting and processing because it is the largest domain and appears in many scenario questions.
  2. Designing systems because it drives architecture, security, compliance, and migration decisions.
  3. Storing data because service selection is one of the most common exam traps.
  4. Maintaining and automating because the exam frequently asks how to reduce cost, automate pipelines, monitor failures, and recover safely.
  5. Preparing and using data for analysis because it often appears as BigQuery performance, BI, sharing, masking, and ML-readiness scenarios.

3. Start-to-Finish Study Path

Phase 1: Foundation

Learn the core Google Cloud data platform map:

  • Cloud Storage: durable object storage, landing zones, data lakes, lifecycle policies, raw files.
  • BigQuery: serverless data warehouse for analytics, SQL, partitioning, clustering, BI, sharing, ML.
  • Pub/Sub: asynchronous messaging and streaming ingestion backbone.
  • Dataflow: Apache Beam-based managed batch and streaming processing.
  • Dataproc: managed Spark/Hadoop for existing Spark, Hadoop, Hive, or custom ecosystem workloads.
  • Cloud Composer: managed Apache Airflow for DAG orchestration.
  • Dataform: SQL-based transformations and dependency management in BigQuery.
  • Cloud Data Fusion: visual/low-code ETL and integration.
  • Dataplex: data lake governance, cataloging, discovery, zones, policy-based governance.
  • Cloud SQL / AlloyDB / Spanner / Bigtable / Firestore / Memorystore: operational databases selected by workload pattern.

Foundation goal: when you see a scenario, quickly map it to the correct managed service.

Phase 2: Intermediate

Study decision patterns:

  • Batch vs streaming.
  • Data warehouse vs data lake vs lakehouse.
  • Relational vs NoSQL vs object storage.
  • OLTP vs OLAP.
  • Low latency serving vs analytical scanning.
  • Event ingestion vs workflow orchestration.
  • Transformation vs orchestration.
  • Governance enforcement vs naming conventions.
  • Regional data residency vs global access.
  • Persistent clusters vs ephemeral job clusters.

Intermediate goal: eliminate wrong answers based on access pattern, operational burden, security, and cost.

Phase 3: Advanced

Focus on architecture and tradeoffs:

  • End-to-end ingestion: source → Pub/Sub/Storage/Datastream → Dataflow/Data Fusion/Dataproc → BigQuery/BigLake/Cloud Storage.
  • Change data capture: Datastream into Cloud Storage or BigQuery-oriented pipelines.
  • Real-time analytics: Pub/Sub + Dataflow + BigQuery.
  • Historical migration: Storage Transfer Service, BigQuery Data Transfer Service, Database Migration Service, Transfer Appliance.
  • Data governance: Dataplex + IAM + policy tags + row-level security + DLP + audit logs.
  • Cost and performance: partitioning, clustering, materialized views, BI Engine, reservations, query optimization.
  • Reliability: idempotent pipelines, dead-letter topics, retries, checkpoints, validation, monitoring, multi-region strategy.

Advanced goal: choose the best architecture under multiple constraints.

Phase 4: Final Review

During the last week:

  • Memorize service-selection rules.
  • Review BigQuery partitioning, clustering, authorized views, policy tags, BI Engine, reservations, and materialized views.
  • Review Dataflow streaming concepts: windows, triggers, late data, watermarks, exactly-once-style processing behavior, dead-letter handling, and idempotent sinks.
  • Review storage selection: BigQuery vs Bigtable vs Spanner vs Cloud SQL vs Firestore vs Cloud Storage.
  • Review security patterns: least privilege, service accounts, CMEK, DLP, row/column security, audit logging, VPC Service Controls.
  • Review operational patterns: Cloud Monitoring, Cloud Logging, Composer DAG retries, quotas, BigQuery job history, cost controls.

4. Core Concepts by Domain

Domain 1: Designing data processing systems

Concepts

This domain tests whether you can design secure, compliant, reliable, flexible, portable, and migration-ready data systems.

Key ideas:

  • Map business requirements to architecture before choosing tools.
  • Separate environments such as development, test, and production using projects, datasets, service accounts, and IAM boundaries.
  • Enforce least privilege at the narrowest practical level.
  • Govern sensitive data using policy tags, row-level security, authorized views, Cloud DLP, and Cloud KMS.
  • Design for data residency by choosing correct BigQuery dataset locations, Cloud Storage bucket locations, and regional services.
  • Build validation and reconciliation into migration plans.
  • Prefer managed, repeatable, auditable patterns over manual scripts.

Services

Requirement Service or feature to think about
Fine-grained access in BigQuery IAM, authorized views, row-level security, column-level security, policy tags
PII discovery or masking Cloud DLP / Sensitive Data Protection, BigQuery masking policies
Encryption key control Cloud KMS, CMEK
Governance and catalog Dataplex, Dataplex Catalog
Data residency Region-specific datasets and buckets
Bulk data transfer Storage Transfer Service, Transfer Appliance
Database migration Database Migration Service
CDC from databases Datastream
Warehouse migration BigQuery Data Transfer Service, staged loads, validation queries

Patterns

Secure analytics pattern

  • Store raw data in restricted datasets.
  • Apply least privilege IAM.
  • Use policy tags for sensitive columns.
  • Use row-level security for business unit or geography restrictions.
  • Expose curated datasets or authorized views to analysts.
  • Audit access with Cloud Audit Logs and BigQuery job history.

Data residency pattern

  • Keep raw data in the required region.
  • Avoid copying sensitive data across regions unless explicitly allowed.
  • Share only aggregated, anonymized, or policy-approved data when global reporting is needed.
  • Validate that BigQuery dataset location, Cloud Storage bucket location, and processing region align.

Migration pattern

  • Analyze current state and stakeholder requirements.
  • Choose migration tool based on source and volume.
  • Perform staged loads.
  • Reconcile row counts, checksums, and business aggregates.
  • Run parallel validation before cutover.
  • Switch consumers only after validation passes.

Traps

  • Choosing naming conventions instead of IAM or policy enforcement.
  • Granting BigQuery Admin to analysts for convenience.
  • Copying regulated data to another region just to simplify reporting.
  • Migrating with a one-time copy and no validation.
  • Using Pub/Sub for historical bulk migration when Storage Transfer Service, BigQuery Data Transfer Service, Database Migration Service, or Transfer Appliance fits better.
  • Building custom governance scripts instead of using Dataplex, IAM, policy tags, DLP, and audit logs.

Domain 2: Ingesting and processing the data

Concepts

This is the largest exam domain. It tests whether you can plan, build, deploy, and operationalize batch and streaming pipelines.

Key ideas:

  • Pub/Sub ingests events. It is not a transformation engine or scheduler.
  • Dataflow transforms batch and streaming data. It is the default managed service for Apache Beam pipelines.
  • Dataproc runs Spark/Hadoop workloads. Use it for existing Spark/Hadoop/Hive ecosystems or custom cluster-level dependencies.
  • Cloud Data Fusion provides visual ETL. Use it when low-code integration and connectors matter.
  • Cloud Composer orchestrates workflows. It coordinates jobs; it should not perform heavy transformations inside DAG code.
  • Dataform manages SQL transformations in BigQuery. Use it for SQL modeling, dependencies, testing, and repeatable BigQuery transformations.

Services

Scenario Best fit Why
Streaming events from apps or devices Pub/Sub Decoupled, durable event ingestion
Real-time transformation and enrichment Dataflow Managed Beam streaming with windows, triggers, state, late data handling
Existing Spark jobs with custom libraries Dataproc Managed Spark/Hadoop compatibility
Low-code ETL with connectors Cloud Data Fusion Visual pipeline design and integration
SQL transformations in BigQuery Dataform Versioned SQL workflows and dependency management
Scheduling complex multi-step workflows Cloud Composer Airflow DAG orchestration
Lightweight service orchestration Workflows Serverless orchestration of APIs and services
CDC from operational databases Datastream Change streams for replication and analytics ingestion
Kafka integration Pub/Sub Kafka connector or managed integration pattern Avoid self-managing unless required

Patterns

Streaming analytics pattern

  1. Producers publish events to Pub/Sub.
  2. Dataflow reads Pub/Sub messages.
  3. Dataflow validates, enriches, windows, and handles late data.
  4. Invalid records go to a dead-letter topic or error table.
  5. Clean results are written to BigQuery or Bigtable depending on access pattern.
  6. Monitoring alerts on backlog, errors, latency, and failed workers.

Batch file ingestion pattern

  1. Files land in Cloud Storage.
  2. Cloud Composer or Eventarc triggers processing.
  3. Dataflow, Dataproc, Data Fusion, or BigQuery loads transform the data.
  4. BigQuery stores curated analytics tables.
  5. Dataform manages SQL models and tests.

CDC analytics pattern

  1. Datastream captures database changes.
  2. Changes land in Cloud Storage or feed downstream pipelines.
  3. Dataflow or BigQuery transformations merge updates into curated tables.
  4. Validate latency, ordering, duplicates, and schema evolution.

Traps

  • Using Cloud Composer as the data processing engine instead of orchestrator.
  • Using Pub/Sub as a database or long-term store.
  • Choosing Dataproc for new simple serverless pipelines when Dataflow is more managed.
  • Choosing Dataflow for a legacy Spark job that must preserve Spark APIs and dependencies; Dataproc is usually better.
  • Ignoring late-arriving data in streaming questions.
  • Writing custom retry scripts instead of using managed retries, dead-letter topics, idempotent processing, and monitoring.
  • Loading malformed records directly into production tables instead of quarantine/error tables.

Domain 3: Storing the data

Concepts

This domain tests service selection and data platform design. Most storage questions are solved by identifying the access pattern.

Ask these questions:

  1. Is it analytics or transactions?
  2. Is it structured, semi-structured, unstructured, or file/object data?
  3. Is the workload read-heavy, write-heavy, or mixed?
  4. Is strong global consistency required?
  5. Is millisecond key-value access required?
  6. Is SQL relational modeling required?
  7. Is horizontal scale more important than joins?
  8. Is the primary access pattern large scans or point lookups?

Services

Service Use when Avoid when
BigQuery Petabyte-scale SQL analytics, BI, ELT, warehouse, federated analytics OLTP transactions, low-latency point updates, application serving database
Cloud Storage Raw files, landing zone, data lake objects, backups, archives Relational queries, high-frequency row updates, transactional workloads
BigLake Governed lakehouse access over data lakes and BigQuery Simple object storage without governance needs
Bigtable Massive scale, low-latency key-value/wide-column, time-series, IoT, high write throughput Ad hoc SQL analytics, joins, transactions across rows
Spanner Globally scalable relational database with strong consistency and high availability Simple single-region relational apps where Cloud SQL is enough
Cloud SQL Managed MySQL/PostgreSQL/SQL Server for traditional relational apps Global horizontal relational scale or massive analytics
AlloyDB High-performance PostgreSQL-compatible operational workloads Non-PostgreSQL workloads or analytical warehouse use cases
Firestore Serverless document database for mobile/web apps with flexible documents Analytical scans, relational joins, warehouse workloads
Memorystore Managed Redis/Memcached caching, session state, low-latency cache Durable source of truth or analytics
Dataplex Governed data lake/platform management, cataloging, zones Replacing storage or processing engines

Patterns

Warehouse pattern

  • Use BigQuery for curated analytical tables.
  • Partition by time or ingestion date for large time-based data.
  • Cluster by frequently filtered/joined columns.
  • Use materialized views or summary tables for repeated expensive aggregations.
  • Use authorized views, row-level security, and policy tags for controlled access.

Lakehouse pattern

  • Land raw data in Cloud Storage.
  • Govern discovery and access with Dataplex and BigLake.
  • Process raw to curated zones using Dataflow, Dataproc, Data Fusion, or BigQuery.
  • Serve analytics in BigQuery.

Operational serving pattern

  • Use Cloud SQL or AlloyDB for traditional relational application databases.
  • Use Spanner for global relational scale with strong consistency.
  • Use Bigtable for extremely high-throughput key-value/time-series workloads.
  • Use Firestore for serverless document-based mobile/web apps.
  • Use Memorystore for caching, not durable storage.

Traps

  • Choosing BigQuery for user-facing low-latency transactional workloads.
  • Choosing Cloud Storage when users need SQL analytics and BI without defining BigQuery or BigLake access.
  • Choosing Cloud SQL for global scale and multi-region strong consistency when Spanner is the better fit.
  • Choosing Bigtable when the question requires joins, SQL, or multi-row transactions.
  • Choosing Firestore for analytical reporting.
  • Treating Memorystore as durable storage.
  • Ignoring lifecycle policies and storage class cost optimization for Cloud Storage.

Domain 4: Preparing and using data for analysis

Concepts

This domain tests whether you can prepare data for BI, ML, sharing, visualization, and secure analysis.

Key ideas:

  • BigQuery is central for analytical preparation.
  • Performance tuning often involves partitioning, clustering, pruning, query rewrite, materialized views, BI Engine, and avoiding SELECT *.
  • Security for analysis often uses row-level security, column-level security, policy tags, authorized views, masking, IAM, and DLP.
  • BigQuery ML is useful when the model can be trained and used directly in BigQuery with SQL.
  • Vertex AI is more appropriate for advanced custom ML workflows, feature stores, training pipelines, deployment, and MLOps.
  • Analytics Hub is used for controlled data sharing and publishing datasets.

Services

Requirement Best fit
Fast BI dashboard over BigQuery BI Engine, materialized views, aggregated tables, partitioning/clustering
Repeated expensive aggregations Materialized views or scheduled summary tables
Controlled dataset sharing Analytics Hub, authorized views, dataset access controls
Mask or classify PII Cloud DLP / Sensitive Data Protection, policy tags, masking
SQL-based ML directly on warehouse data BigQuery ML
Advanced custom ML lifecycle Vertex AI
Prepare unstructured text for RAG Embeddings, vector search patterns, preprocessing pipelines, governed storage

Patterns

BI performance pattern

  • Partition large fact tables by date.
  • Cluster by high-cardinality filter or join columns where useful.
  • Precompute common aggregates.
  • Use materialized views for repeated deterministic aggregations.
  • Enable BI Engine for interactive dashboards.
  • Avoid scanning unnecessary columns and partitions.

Secure sharing pattern

  • Publish curated datasets rather than raw sensitive data.
  • Use Analytics Hub for managed sharing.
  • Use authorized views to expose limited views.
  • Use policy tags and masking for sensitive columns.
  • Use row-level security for tenant, region, or department filtering.

ML preparation pattern

  • Use BigQuery to clean, join, and prepare structured features.
  • Use BigQuery ML for SQL-native models and simple forecasting/classification/regression.
  • Use Vertex AI for custom training, feature management, pipelines, endpoints, and MLOps.
  • For unstructured data/RAG, prepare chunking, metadata, embeddings, access controls, and retrieval quality validation.

Traps

  • Solving every slow dashboard with more slots before optimizing table design and query patterns.
  • Exposing raw datasets when curated views or shared listings are safer.
  • Using BigQuery ML for complex custom ML lifecycle requirements better served by Vertex AI.
  • Forgetting PII masking and column-level controls in analytics environments.
  • Using CSV exports as the primary sharing mechanism when Analytics Hub or BigQuery sharing is better.

Domain 5: Maintaining and automating data workloads

Concepts

This domain tests operational excellence: automation, cost, monitoring, troubleshooting, capacity, fault tolerance, and repeatability.

Key ideas:

  • Use Cloud Composer for DAG-based orchestration.
  • Use Dataform for repeatable SQL transformations in BigQuery.
  • Use Cloud Monitoring and Cloud Logging for observability.
  • Use BigQuery admin tools, job history, INFORMATION_SCHEMA, audit logs, reservations, and slot metrics for BigQuery troubleshooting.
  • Use reservations and Editions for predictable BigQuery capacity management.
  • Use partitioning, clustering, query optimization, and lifecycle policies before blindly scaling resources.
  • Use retries, idempotency, checkpoints, dead-letter topics, and alerting for failure management.

Services

Operational need Service or feature
DAG scheduling and dependencies Cloud Composer
SQL transformation dependencies Dataform
API/service orchestration Workflows
Pipeline metrics and alerts Cloud Monitoring
Logs and error analysis Cloud Logging
BigQuery troubleshooting BigQuery admin panel, job history, INFORMATION_SCHEMA, audit logs
Capacity management BigQuery Editions, reservations, slots
Cost controls Budgets, labels, partitioning, clustering, lifecycle policies, reservations
Fault tolerance Retries, checkpoints, dead-letter queues, idempotent writes, regional design

Patterns

Reliable DAG pattern

  • Keep DAG tasks small and idempotent.
  • Use retries with backoff.
  • Store secrets in Secret Manager, not in code.
  • Use service accounts with least privilege.
  • Monitor SLA misses, retries, and failures.
  • Do not run large transformations inside the scheduler process.

BigQuery cost pattern

  • Partition large tables by date or ingestion time.
  • Cluster when filters repeatedly use specific columns.
  • Avoid SELECT *.
  • Use dry runs and query estimates.
  • Use materialized views or summary tables for repeated calculations.
  • Use reservations for predictable capacity or workload isolation.
  • Use labels for chargeback and monitoring.

Failure recovery pattern

  • Detect with Cloud Monitoring and Logging.
  • Quarantine bad data.
  • Use idempotent reprocessing.
  • Use checkpoints or replayable sources.
  • Validate output completeness and quality.
  • Alert owners and maintain runbooks.

Traps

  • Using cron on a VM instead of Cloud Composer or managed scheduling for critical pipelines.
  • Scaling BigQuery slots without checking partition pruning, clustering, and query design.
  • Using persistent Dataproc clusters for infrequent jobs when ephemeral clusters reduce cost.
  • Ignoring quotas and billing alerts.
  • Not designing pipelines to restart safely.
  • Monitoring only infrastructure metrics while ignoring data quality and business-level pipeline metrics.

lock

Module 5 is locked

Unlock all 12 modules, exam traps, cheat sheets, and 1100+ practice questions.

lock

Module 6 is locked

Unlock all 12 modules, exam traps, cheat sheets, and 1100+ practice questions.

lock

Module 7 is locked

Unlock all 12 modules, exam traps, cheat sheets, and 1100+ practice questions.

lock

Module 8 is locked

Unlock all 12 modules, exam traps, cheat sheets, and 1100+ practice questions.

lock

Module 9 is locked

Unlock all 12 modules, exam traps, cheat sheets, and 1100+ practice questions.

lock

Module 10 is locked

Unlock all 12 modules, exam traps, cheat sheets, and 1100+ practice questions.

lock

Module 11 is locked

Unlock all 12 modules, exam traps, cheat sheets, and 1100+ practice questions.

lock

Module 12 is locked

Unlock all 12 modules, exam traps, cheat sheets, and 1100+ practice questions.

Ready to Test Your Knowledge?

Take a practice exam with 1100+ real questions and detailed explanations.

Course Modules

12 modules

Unlock All Modules

Get full access to all 12 modules

auto_stories More Guides