Google Cloud Professional Data Engineer Exam Course
1. Exam Overview
What the exam is testing
The Google Cloud Professional Data Engineer exam validates whether you can design, build, operationalize, secure, monitor, optimize, and troubleshoot data processing systems on Google Cloud. The exam is not mainly a memorization test. It tests whether you can read a business scenario, identify the real constraint, eliminate tempting but wrong services, and choose the most managed, secure, reliable, scalable, and cost-effective Google Cloud architecture.
The current official standard exam guide organizes the exam into five domains:
Designing data processing systems
Ingesting and processing the data
Storing the data
Preparing and using data for analysis
Maintaining and automating data workloads
The standard exam is 2 hours, contains 40-50 multiple-choice and multiple-select questions, and is available in English and Japanese. The certification is valid for 2 years. Google also offers a shorter renewal exam for active certificate holders.
How to think like the exam
In most questions, the correct answer is the option that best balances the following priorities:
Meet the business requirement first. Do not optimize cost if the scenario says the system is mission critical and needs low latency or high availability.
Use managed services when possible. Prefer BigQuery, Dataflow, Pub/Sub, Dataplex, Cloud Composer, Cloud Data Fusion, or managed databases over self-managed infrastructure unless the scenario explicitly requires custom frameworks or legacy compatibility.
Use least privilege and governance by design. Correct answers often include IAM, service accounts, policy tags, authorized views, row-level security, Cloud KMS, VPC Service Controls, Dataplex, and audit logging.
Select the service by access pattern. Storage questions are usually about reads, writes, latency, consistency, scale, relational requirements, and analytics requirements.
Avoid operational burden. If two options work, the exam usually favors the one with less manual administration and fewer custom scripts.
Watch the words. Terms such as streaming, near real time, ACID, global consistency, time-series, low latency, petabyte analytics, serverless, batch, orchestration, CDC, data residency, and PII usually point to specific services.
How to use this course
Use this file as a compressed revision guide. Start with the exam domains, then study the service-selection tables, then practice the architecture patterns and traps. The question bank behind this course repeatedly emphasizes BigQuery, Dataflow, Pub/Sub, Cloud Storage, Cloud SQL, Bigtable, Spanner, Dataplex, Cloud Composer, Dataform, IAM, policy tags, authorized views, Analytics Hub, BI Engine, and monitoring/optimization patterns. Those are the highest-yield services and decisions for this exam.
Cloud DLP / Sensitive Data Protection, BigQuery masking policies
Encryption key control
Cloud KMS, CMEK
Governance and catalog
Dataplex, Dataplex Catalog
Data residency
Region-specific datasets and buckets
Bulk data transfer
Storage Transfer Service, Transfer Appliance
Database migration
Database Migration Service
CDC from databases
Datastream
Warehouse migration
BigQuery Data Transfer Service, staged loads, validation queries
Patterns
Secure analytics pattern
Store raw data in restricted datasets.
Apply least privilege IAM.
Use policy tags for sensitive columns.
Use row-level security for business unit or geography restrictions.
Expose curated datasets or authorized views to analysts.
Audit access with Cloud Audit Logs and BigQuery job history.
Data residency pattern
Keep raw data in the required region.
Avoid copying sensitive data across regions unless explicitly allowed.
Share only aggregated, anonymized, or policy-approved data when global reporting is needed.
Validate that BigQuery dataset location, Cloud Storage bucket location, and processing region align.
Migration pattern
Analyze current state and stakeholder requirements.
Choose migration tool based on source and volume.
Perform staged loads.
Reconcile row counts, checksums, and business aggregates.
Run parallel validation before cutover.
Switch consumers only after validation passes.
Traps
Choosing naming conventions instead of IAM or policy enforcement.
Granting BigQuery Admin to analysts for convenience.
Copying regulated data to another region just to simplify reporting.
Migrating with a one-time copy and no validation.
Using Pub/Sub for historical bulk migration when Storage Transfer Service, BigQuery Data Transfer Service, Database Migration Service, or Transfer Appliance fits better.
Building custom governance scripts instead of using Dataplex, IAM, policy tags, DLP, and audit logs.
Domain 2: Ingesting and processing the data
Concepts
This is the largest exam domain. It tests whether you can plan, build, deploy, and operationalize batch and streaming pipelines.
Key ideas:
Pub/Sub ingests events. It is not a transformation engine or scheduler.
Dataflow transforms batch and streaming data. It is the default managed service for Apache Beam pipelines.
Dataproc runs Spark/Hadoop workloads. Use it for existing Spark/Hadoop/Hive ecosystems or custom cluster-level dependencies.
Cloud Data Fusion provides visual ETL. Use it when low-code integration and connectors matter.
Cloud Composer orchestrates workflows. It coordinates jobs; it should not perform heavy transformations inside DAG code.
Dataform manages SQL transformations in BigQuery. Use it for SQL modeling, dependencies, testing, and repeatable BigQuery transformations.
Services
Scenario
Best fit
Why
Streaming events from apps or devices
Pub/Sub
Decoupled, durable event ingestion
Real-time transformation and enrichment
Dataflow
Managed Beam streaming with windows, triggers, state, late data handling
Existing Spark jobs with custom libraries
Dataproc
Managed Spark/Hadoop compatibility
Low-code ETL with connectors
Cloud Data Fusion
Visual pipeline design and integration
SQL transformations in BigQuery
Dataform
Versioned SQL workflows and dependency management
Scheduling complex multi-step workflows
Cloud Composer
Airflow DAG orchestration
Lightweight service orchestration
Workflows
Serverless orchestration of APIs and services
CDC from operational databases
Datastream
Change streams for replication and analytics ingestion
Kafka integration
Pub/Sub Kafka connector or managed integration pattern
Avoid self-managing unless required
Patterns
Streaming analytics pattern
Producers publish events to Pub/Sub.
Dataflow reads Pub/Sub messages.
Dataflow validates, enriches, windows, and handles late data.
Invalid records go to a dead-letter topic or error table.
Clean results are written to BigQuery or Bigtable depending on access pattern.
Monitoring alerts on backlog, errors, latency, and failed workers.
Batch file ingestion pattern
Files land in Cloud Storage.
Cloud Composer or Eventarc triggers processing.
Dataflow, Dataproc, Data Fusion, or BigQuery loads transform the data.
BigQuery stores curated analytics tables.
Dataform manages SQL models and tests.
CDC analytics pattern
Datastream captures database changes.
Changes land in Cloud Storage or feed downstream pipelines.
Dataflow or BigQuery transformations merge updates into curated tables.
Validate latency, ordering, duplicates, and schema evolution.
Traps
Using Cloud Composer as the data processing engine instead of orchestrator.
Using Pub/Sub as a database or long-term store.
Choosing Dataproc for new simple serverless pipelines when Dataflow is more managed.
Choosing Dataflow for a legacy Spark job that must preserve Spark APIs and dependencies; Dataproc is usually better.
Ignoring late-arriving data in streaming questions.
Writing custom retry scripts instead of using managed retries, dead-letter topics, idempotent processing, and monitoring.
Loading malformed records directly into production tables instead of quarantine/error tables.
Domain 3: Storing the data
Concepts
This domain tests service selection and data platform design. Most storage questions are solved by identifying the access pattern.
Ask these questions:
Is it analytics or transactions?
Is it structured, semi-structured, unstructured, or file/object data?
Is the workload read-heavy, write-heavy, or mixed?
Is strong global consistency required?
Is millisecond key-value access required?
Is SQL relational modeling required?
Is horizontal scale more important than joins?
Is the primary access pattern large scans or point lookups?
Cloud SQL will not scale to massive write throughput as well.
Memorystore is a cache, not durable storage.
Pattern 8: Dashboard acceleration
Scenario: Dashboards are slow and scan large BigQuery tables repeatedly.
Recommended solution: Review query design, partitioning, clustering, materialized views/summary tables, and BI Engine before increasing capacity.
Why: Most BI performance questions expect data modeling and query optimization first.
Why alternatives are wrong:
Buying more slots may help but can hide inefficient table design.
Exporting data to spreadsheets or CSV is not a scalable BI architecture.
Pattern 9: Repeatable SQL transformation
Scenario: Analytics engineers need versioned SQL models with dependencies and tests in BigQuery.
Recommended solution: Dataform.
Why: Dataform is designed for SQL workflow management in BigQuery.
Why alternatives are wrong:
Composer can schedule the workflow but should not replace SQL model dependency management.
Ad hoc queries are not repeatable or testable.
Pattern 10: Operational troubleshooting
Scenario: A pipeline failed overnight and downstream reports are incomplete.
Recommended solution: Use Cloud Logging, Cloud Monitoring, job history, Composer task logs, Dataflow job metrics, BigQuery INFORMATION_SCHEMA, and data validation checks. Reprocess idempotently from the last safe checkpoint.
Why: Troubleshooting must identify both infrastructure failure and data-quality impact.
Why alternatives are wrong:
Restarting everything without understanding duplicates or partial writes can corrupt outputs.
Monitoring only VM CPU misses managed service failures and data quality issues.
7. Exam Traps
Misleading wording
Wording in question
What it usually means
Near real-time, streaming, late events
Pub/Sub + Dataflow
Existing Spark/Hadoop
Dataproc
Visual ETL, low-code, connectors
Cloud Data Fusion
SQL transformations in BigQuery
Dataform
Scheduled DAG, dependencies
Cloud Composer
API orchestration, lightweight workflow
Workflows
Petabyte SQL analytics
BigQuery
Raw landing zone, files, archive
Cloud Storage
Governed data lake
Dataplex / BigLake
Global relational consistency
Spanner
Traditional MySQL/PostgreSQL/SQL Server
Cloud SQL
High-performance PostgreSQL-compatible
AlloyDB
Wide-column/time-series massive writes
Bigtable
Mobile/web document database
Firestore
Cache/session
Memorystore
Share curated datasets
Analytics Hub / authorized views
PII detection or de-identification
Cloud DLP / Sensitive Data Protection
Sensitive columns in BigQuery
Policy tags / masking
Regional compliance
Regional datasets/buckets and avoid cross-region copies
Wrong-but-plausible answers
BigQuery Admin for analysts: works technically but violates least privilege.
Naming conventions for security: not enforceable.
Cloud Composer for transformations: Composer orchestrates; Dataflow, Dataproc, BigQuery, Dataform, or Data Fusion transform.
Pub/Sub for storage: Pub/Sub is messaging, not a long-term database.
Cloud Storage for interactive analytics: good for files, but analytics need BigQuery/BigLake patterns.
BigQuery for OLTP: wrong for transactional application databases.
Cloud SQL for global scale: Spanner is usually the better answer.
Memorystore as source of truth: it is a cache.
Manual scripts: usually wrong when managed services provide orchestration, monitoring, retries, and governance.
More resources before optimization: often wrong for BigQuery performance and cost questions.
Common distractors
Overly broad IAM roles.
Exporting sensitive data to CSV.
Copying data across regions without compliance approval.
Persistent clusters for infrequent jobs.
Ignoring partitioning/clustering.
Ignoring dead-letter handling in streaming pipelines.
Ignoring schema evolution and data validation.
Using a custom VM service when a managed service exists.
Optimizing cost while ignoring stated reliability or latency requirements.
Elimination strategy
When stuck, eliminate answers in this order:
Security violations: broad admin roles, public buckets, no encryption/key control, raw PII exposure.
Requirement mismatch: batch service for streaming need, analytics database for OLTP, cache as durable store.
Cost/performance mismatch: no partitioning, no lifecycle management, persistent idle clusters.
Incomplete solution: answers that solve ingestion but not transformation, or storage but not access control.
8. Quick Memory Rules
Rules of thumb
If it says streaming events โ Pub/Sub + Dataflow.
If it says existing Spark/Hadoop โ Dataproc.
If it says SQL analytics at scale โ BigQuery.
If it says raw files or landing zone โ Cloud Storage.
If it says governed lake โ Dataplex + BigLake/BigQuery.
If it says DAG orchestration โ Cloud Composer.
If it says SQL transformation dependencies โ Dataform.
If it says global relational strong consistency โ Spanner.
If it says traditional relational database โ Cloud SQL.
If it says high-performance PostgreSQL-compatible operational database โ AlloyDB.
If it says massive time-series/key-value โ Bigtable.
If it says mobile/web document database โ Firestore.
If it says cache/session โ Memorystore.
If it says PII detection โ Cloud DLP / Sensitive Data Protection.
If it says sensitive BigQuery columns โ policy tags and masking.
If it says curated sharing โ Analytics Hub or authorized views.
If it says slow BI โ partition, cluster, materialized views, BI Engine, then capacity.
If it says cost in BigQuery โ reduce scanned bytes before adding slots.
If it says infrequent Spark jobs โ ephemeral Dataproc clusters.
If it says mission critical pipeline โ retries, idempotency, monitoring, alerts, dead-letter handling.
Fast service mapping
If you see...
Think...
Watermarks, windows, late data
Dataflow
Pub/sub messages, decoupled producers
Pub/Sub
Airflow DAG
Cloud Composer
SQL DAG/modeling
Dataform
Kafka/Spark/Hadoop legacy
Dataproc or integration path
Visual pipeline designer
Cloud Data Fusion
Data catalog/discovery/governance zones
Dataplex
Authorized subset of BigQuery data
Authorized views
Column classification
Policy tags
Data marketplace/sharing
Analytics Hub
Dashboard acceleration
BI Engine
Repeated aggregation
Materialized view or summary table
Capacity isolation
BigQuery reservations
CDC
Datastream
Database migration
Database Migration Service
Large appliance transfer
Transfer Appliance
Mini decision frameworks
Processing framework
Need real-time? โ Pub/Sub + Dataflow.
Need existing Spark/Hadoop? โ Dataproc.
Need low-code ETL? โ Cloud Data Fusion.
Need BigQuery SQL modeling? โ Dataform.
Need orchestration? โ Composer or Workflows.
Storage framework
Analytics SQL? โ BigQuery.
Files/raw lake? โ Cloud Storage.
Governed lakehouse? โ BigLake/Dataplex.
Traditional relational? โ Cloud SQL/AlloyDB.
Global relational scale? โ Spanner.
Massive key-value/time-series? โ Bigtable.
Document app data? โ Firestore.
Cache? โ Memorystore.
Security framework
Identify data sensitivity.
Apply least privilege IAM.
Separate raw and curated zones.
Use row/column controls.
Mask or de-identify PII.
Encrypt with Google-managed keys or CMEK when required.
Audit access.
Keep data in compliant regions.
BigQuery optimization framework
Reduce scanned bytes.
Partition and cluster.
Avoid SELECT *.
Precompute repeated results.
Use materialized views/BI Engine for dashboards.
Use reservations/slots for predictable capacity.
Monitor jobs and cost.
9. Final Revision Notes
Highest-yield review points
BigQuery is the center of analytics. Know partitioning, clustering, authorized views, row-level security, policy tags, masking, BI Engine, materialized views, reservations, and job troubleshooting.
Dataflow is the center of managed streaming and Beam-based batch processing. Know windows, triggers, late data, dead-letter handling, and idempotent writes.
Pub/Sub is ingestion and decoupling, not transformation or storage.
Cloud Composer orchestrates workflows, not heavy transformations.
Dataform is for BigQuery SQL transformation workflows.
Dataproc is mainly for Spark/Hadoop compatibility.
Dataplex is governance and discovery across distributed data assets.
Storage selection depends on access pattern, not brand familiarity.
Security answers should enforce controls, not rely on process or naming.
Migration answers should include planning, staged execution, validation, and cutover.
Cost answers should usually optimize design before increasing capacity.
Reliability answers should include retries, idempotency, monitoring, alerting, and recovery strategy.
Last-day revision list
Review these until they are automatic:
Pub/Sub + Dataflow + BigQuery for real-time analytics.
Cloud Storage + BigQuery for batch analytics.
Dataproc for existing Spark/Hadoop.
Composer for DAGs; Dataform for SQL transformations.
Dataplex for governed data platform and lake management.
BigQuery vs Bigtable vs Spanner vs Cloud SQL vs Firestore.
Policy tags vs row-level security vs authorized views.
DLP vs KMS vs IAM.
Partitioning vs clustering vs materialized views vs BI Engine vs reservations.
CDC with Datastream.
Migration tools: Storage Transfer Service, BigQuery Data Transfer Service, Database Migration Service, Transfer Appliance.
Design for late data, bad records, retries, and idempotency.
Storing the data
BigQuery for analytics.
Cloud Storage for raw object storage.
BigLake/Dataplex for governed lakehouse.
Bigtable for wide-column massive low-latency workloads.
Spanner for global relational consistency.
Cloud SQL/AlloyDB for managed relational operational workloads.
Firestore for document app data.
Memorystore for cache.
Preparing and using data for analysis
Optimize BI with partitioning, clustering, materialized views, BI Engine.
Share with Analytics Hub and authorized views.
Protect PII with DLP, masking, and policy tags.
Use BigQuery ML for SQL-native ML; Vertex AI for advanced ML workflows.
Prepare RAG data with chunking, embeddings, metadata, and access controls.
Maintaining and automating data workloads
Automate with Composer, Dataform, Workflows, CI/CD.
Monitor with Cloud Monitoring and Logging.
Troubleshoot with job histories, logs, metrics, audit logs, and quotas.
Optimize cost before scaling resources.
Use retries, idempotency, checkpoints, and dead-letter patterns.
Build runbooks and alerts for failures.
Source Alignment Notes
This course was synthesized from the generated 1,100-question practice CSV and aligned to the current Google Cloud Professional Data Engineer standard exam guide. The repeated patterns in the source bank emphasized BigQuery, Dataflow, Pub/Sub, Cloud Storage, Cloud SQL, Bigtable, Spanner, Dataplex, Cloud Composer, Dataform, IAM, policy tags, authorized views, Analytics Hub, BI Engine, Cloud Monitoring, Cloud Logging, cost optimization, governance, and reliability patterns.
lock_open
Unlock the full course
All 12 modules with detailed explanations, code examples, and exam tips.