arrow_back Cert

Google GCP Professional Data Engineer

Study & Mastery Mode

🔥 0 streak

Score 0% /70%

timer Mock Exam lock Pro menu_book Course description 3-Page download Free

menu_book

GCP Professional Data Engineer

Compressed Course

Google Cloud Professional Data Engineer Exam Course

1. Exam Overview

What the exam is testing

The Google Cloud Professional Data Engineer exam validates whether you can design, build, operationalize, secure, monitor, optimize, and troubleshoot data processing systems on Google Cloud. The exam is not mainly a memorization test. It tests whether you can read a business scenario, identify the real constraint, eliminate tempting but wrong services, and choose the most managed, secure, reliable, scalable, and cost-effective Google Cloud architecture.

The current official standard exam guide organizes the exam into five domains:

Designing data processing systems
Ingesting and processing the data
Storing the data
Preparing and using data for analysis
Maintaining and automating data workloads

The standard exam is 2 hours, contains 40-50 multiple-choice and multiple-select questions, and is available in English and Japanese. The certification is valid for 2 years. Google also offers a shorter renewal exam for active certificate holders.

How to think like the exam

In most questions, the correct answer is the option that best balances the following priorities:

Meet the business requirement first. Do not optimize cost if the scenario says the system is mission critical and needs low latency or high availability.
Use managed services when possible. Prefer BigQuery, Dataflow, Pub/Sub, Dataplex, Cloud Composer, Cloud Data Fusion, or managed databases over self-managed infrastructure unless the scenario explicitly requires custom frameworks or legacy compatibility.
Use least privilege and governance by design. Correct answers often include IAM, service accounts, policy tags, authorized views, row-level security, Cloud KMS, VPC Service Controls, Dataplex, and audit logging.
Select the service by access pattern. Storage questions are usually about reads, writes, latency, consistency, scale, relational requirements, and analytics requirements.
Avoid operational burden. If two options work, the exam usually favors the one with less manual administration and fewer custom scripts.
Watch the words. Terms such as streaming, near real time, ACID, global consistency, time-series, low latency, petabyte analytics, serverless, batch, orchestration, CDC, data residency, and PII usually point to specific services.

How to use this course

Use this file as a compressed revision guide. Start with the exam domains, then study the service-selection tables, then practice the architecture patterns and traps. The question bank behind this course repeatedly emphasizes BigQuery, Dataflow, Pub/Sub, Cloud Storage, Cloud SQL, Bigtable, Spanner, Dataplex, Cloud Composer, Dataform, IAM, policy tags, authorized views, Analytics Hub, BI Engine, and monitoring/optimization patterns. Those are the highest-yield services and decisions for this exam.

2. Exam Domains

Domain	Official Weight	Priority	What matters most
Designing data processing systems	~22%	Very high	Security, compliance, governance, reliability, portability, migrations, project/dataset/table architecture
Ingesting and processing the data	~25%	Highest	Dataflow, Pub/Sub, Beam, Dataproc, Data Fusion, pipeline design, batch vs streaming, orchestration, CI/CD
Storing the data	~20%	Very high	BigQuery, Cloud Storage, BigLake, Bigtable, Spanner, Cloud SQL, Firestore, Memorystore, data lake/warehouse design
Preparing and using data for analysis	~15%	Medium	BI patterns, BigQuery performance, BigQuery ML, data sharing, Analytics Hub, masking, DLP, reports
Maintaining and automating data workloads	~18%	High	Cost, reservations, Composer DAGs, monitoring, troubleshooting, fault tolerance, quotas, restarts, replication

Priority notes

The highest-value preparation sequence is:

Ingesting and processing because it is the largest domain and appears in many scenario questions.
Designing systems because it drives architecture, security, compliance, and migration decisions.
Storing data because service selection is one of the most common exam traps.
Maintaining and automating because the exam frequently asks how to reduce cost, automate pipelines, monitor failures, and recover safely.
Preparing and using data for analysis because it often appears as BigQuery performance, BI, sharing, masking, and ML-readiness scenarios.

3. Start-to-Finish Study Path

Phase 1: Foundation

Learn the core Google Cloud data platform map:

Cloud Storage: durable object storage, landing zones, data lakes, lifecycle policies, raw files.
BigQuery: serverless data warehouse for analytics, SQL, partitioning, clustering, BI, sharing, ML.
Pub/Sub: asynchronous messaging and streaming ingestion backbone.
Dataflow: Apache Beam-based managed batch and streaming processing.
Dataproc: managed Spark/Hadoop for existing Spark, Hadoop, Hive, or custom ecosystem workloads.
Cloud Composer: managed Apache Airflow for DAG orchestration.
Dataform: SQL-based transformations and dependency management in BigQuery.
Cloud Data Fusion: visual/low-code ETL and integration.
Dataplex: data lake governance, cataloging, discovery, zones, policy-based governance.
Cloud SQL / AlloyDB / Spanner / Bigtable / Firestore / Memorystore: operational databases selected by workload pattern.

Foundation goal: when you see a scenario, quickly map it to the correct managed service.

Phase 2: Intermediate

Study decision patterns:

Batch vs streaming.
Data warehouse vs data lake vs lakehouse.
Relational vs NoSQL vs object storage.
OLTP vs OLAP.
Low latency serving vs analytical scanning.
Event ingestion vs workflow orchestration.
Transformation vs orchestration.
Governance enforcement vs naming conventions.
Regional data residency vs global access.
Persistent clusters vs ephemeral job clusters.

Intermediate goal: eliminate wrong answers based on access pattern, operational burden, security, and cost.

Phase 3: Advanced

Focus on architecture and tradeoffs:

End-to-end ingestion: source → Pub/Sub/Storage/Datastream → Dataflow/Data Fusion/Dataproc → BigQuery/BigLake/Cloud Storage.
Change data capture: Datastream into Cloud Storage or BigQuery-oriented pipelines.
Real-time analytics: Pub/Sub + Dataflow + BigQuery.
Historical migration: Storage Transfer Service, BigQuery Data Transfer Service, Database Migration Service, Transfer Appliance.
Data governance: Dataplex + IAM + policy tags + row-level security + DLP + audit logs.
Cost and performance: partitioning, clustering, materialized views, BI Engine, reservations, query optimization.
Reliability: idempotent pipelines, dead-letter topics, retries, checkpoints, validation, monitoring, multi-region strategy.

Advanced goal: choose the best architecture under multiple constraints.

Phase 4: Final Review

During the last week:

Memorize service-selection rules.
Review BigQuery partitioning, clustering, authorized views, policy tags, BI Engine, reservations, and materialized views.
Review Dataflow streaming concepts: windows, triggers, late data, watermarks, exactly-once-style processing behavior, dead-letter handling, and idempotent sinks.
Review storage selection: BigQuery vs Bigtable vs Spanner vs Cloud SQL vs Firestore vs Cloud Storage.
Review security patterns: least privilege, service accounts, CMEK, DLP, row/column security, audit logging, VPC Service Controls.
Review operational patterns: Cloud Monitoring, Cloud Logging, Composer DAG retries, quotas, BigQuery job history, cost controls.

4. Core Concepts by Domain

Domain 1: Designing data processing systems

Concepts

This domain tests whether you can design secure, compliant, reliable, flexible, portable, and migration-ready data systems.

Key ideas:

Map business requirements to architecture before choosing tools.
Separate environments such as development, test, and production using projects, datasets, service accounts, and IAM boundaries.
Enforce least privilege at the narrowest practical level.
Govern sensitive data using policy tags, row-level security, authorized views, Cloud DLP, and Cloud KMS.
Design for data residency by choosing correct BigQuery dataset locations, Cloud Storage bucket locations, and regional services.
Build validation and reconciliation into migration plans.
Prefer managed, repeatable, auditable patterns over manual scripts.

Services

Requirement	Service or feature to think about
Fine-grained access in BigQuery	IAM, authorized views, row-level security, column-level security, policy tags
PII discovery or masking	Cloud DLP / Sensitive Data Protection, BigQuery masking policies
Encryption key control	Cloud KMS, CMEK
Governance and catalog	Dataplex, Dataplex Catalog
Data residency	Region-specific datasets and buckets
Bulk data transfer	Storage Transfer Service, Transfer Appliance
Database migration	Database Migration Service
CDC from databases	Datastream
Warehouse migration	BigQuery Data Transfer Service, staged loads, validation queries

Patterns

Secure analytics pattern

Store raw data in restricted datasets.
Apply least privilege IAM.
Use policy tags for sensitive columns.
Use row-level security for business unit or geography restrictions.
Expose curated datasets or authorized views to analysts.
Audit access with Cloud Audit Logs and BigQuery job history.

Data residency pattern

Keep raw data in the required region.
Avoid copying sensitive data across regions unless explicitly allowed.
Share only aggregated, anonymized, or policy-approved data when global reporting is needed.
Validate that BigQuery dataset location, Cloud Storage bucket location, and processing region align.

Migration pattern

Analyze current state and stakeholder requirements.
Choose migration tool based on source and volume.
Perform staged loads.
Reconcile row counts, checksums, and business aggregates.
Run parallel validation before cutover.
Switch consumers only after validation passes.

Traps

Choosing naming conventions instead of IAM or policy enforcement.
Granting BigQuery Admin to analysts for convenience.
Copying regulated data to another region just to simplify reporting.
Migrating with a one-time copy and no validation.
Using Pub/Sub for historical bulk migration when Storage Transfer Service, BigQuery Data Transfer Service, Database Migration Service, or Transfer Appliance fits better.
Building custom governance scripts instead of using Dataplex, IAM, policy tags, DLP, and audit logs.

Domain 2: Ingesting and processing the data

Concepts

This is the largest exam domain. It tests whether you can plan, build, deploy, and operationalize batch and streaming pipelines.

Key ideas:

Pub/Sub ingests events. It is not a transformation engine or scheduler.
Dataflow transforms batch and streaming data. It is the default managed service for Apache Beam pipelines.
Dataproc runs Spark/Hadoop workloads. Use it for existing Spark/Hadoop/Hive ecosystems or custom cluster-level dependencies.
Cloud Data Fusion provides visual ETL. Use it when low-code integration and connectors matter.
Cloud Composer orchestrates workflows. It coordinates jobs; it should not perform heavy transformations inside DAG code.
Dataform manages SQL transformations in BigQuery. Use it for SQL modeling, dependencies, testing, and repeatable BigQuery transformations.

Services

Scenario	Best fit	Why
Streaming events from apps or devices	Pub/Sub	Decoupled, durable event ingestion
Real-time transformation and enrichment	Dataflow	Managed Beam streaming with windows, triggers, state, late data handling
Existing Spark jobs with custom libraries	Dataproc	Managed Spark/Hadoop compatibility
Low-code ETL with connectors	Cloud Data Fusion	Visual pipeline design and integration
SQL transformations in BigQuery	Dataform	Versioned SQL workflows and dependency management
Scheduling complex multi-step workflows	Cloud Composer	Airflow DAG orchestration
Lightweight service orchestration	Workflows	Serverless orchestration of APIs and services
CDC from operational databases	Datastream	Change streams for replication and analytics ingestion
Kafka integration	Pub/Sub Kafka connector or managed integration pattern	Avoid self-managing unless required

Patterns

Streaming analytics pattern

Producers publish events to Pub/Sub.
Dataflow reads Pub/Sub messages.
Dataflow validates, enriches, windows, and handles late data.
Invalid records go to a dead-letter topic or error table.
Clean results are written to BigQuery or Bigtable depending on access pattern.
Monitoring alerts on backlog, errors, latency, and failed workers.

Batch file ingestion pattern

Files land in Cloud Storage.
Cloud Composer or Eventarc triggers processing.
Dataflow, Dataproc, Data Fusion, or BigQuery loads transform the data.
BigQuery stores curated analytics tables.
Dataform manages SQL models and tests.

CDC analytics pattern

Datastream captures database changes.
Changes land in Cloud Storage or feed downstream pipelines.
Dataflow or BigQuery transformations merge updates into curated tables.
Validate latency, ordering, duplicates, and schema evolution.

Traps

Using Cloud Composer as the data processing engine instead of orchestrator.
Using Pub/Sub as a database or long-term store.
Choosing Dataproc for new simple serverless pipelines when Dataflow is more managed.
Choosing Dataflow for a legacy Spark job that must preserve Spark APIs and dependencies; Dataproc is usually better.
Ignoring late-arriving data in streaming questions.
Writing custom retry scripts instead of using managed retries, dead-letter topics, idempotent processing, and monitoring.
Loading malformed records directly into production tables instead of quarantine/error tables.

Domain 3: Storing the data

Concepts

This domain tests service selection and data platform design. Most storage questions are solved by identifying the access pattern.

Ask these questions:

Is it analytics or transactions?
Is it structured, semi-structured, unstructured, or file/object data?
Is the workload read-heavy, write-heavy, or mixed?
Is strong global consistency required?
Is millisecond key-value access required?
Is SQL relational modeling required?
Is horizontal scale more important than joins?
Is the primary access pattern large scans or point lookups?

Services

Service	Use when	Avoid when
BigQuery	Petabyte-scale SQL analytics, BI, ELT, warehouse, federated analytics	OLTP transactions, low-latency point updates, application serving database
Cloud Storage	Raw files, landing zone, data lake objects, backups, archives	Relational queries, high-frequency row updates, transactional workloads
BigLake	Governed lakehouse access over data lakes and BigQuery	Simple object storage without governance needs
Bigtable	Massive scale, low-latency key-value/wide-column, time-series, IoT, high write throughput	Ad hoc SQL analytics, joins, transactions across rows
Spanner	Globally scalable relational database with strong consistency and high availability	Simple single-region relational apps where Cloud SQL is enough
Cloud SQL	Managed MySQL/PostgreSQL/SQL Server for traditional relational apps	Global horizontal relational scale or massive analytics
AlloyDB	High-performance PostgreSQL-compatible operational workloads	Non-PostgreSQL workloads or analytical warehouse use cases
Firestore	Serverless document database for mobile/web apps with flexible documents	Analytical scans, relational joins, warehouse workloads
Memorystore	Managed Redis/Memcached caching, session state, low-latency cache	Durable source of truth or analytics
Dataplex	Governed data lake/platform management, cataloging, zones	Replacing storage or processing engines

Patterns

Warehouse pattern

Use BigQuery for curated analytical tables.
Partition by time or ingestion date for large time-based data.
Cluster by frequently filtered/joined columns.
Use materialized views or summary tables for repeated expensive aggregations.
Use authorized views, row-level security, and policy tags for controlled access.

Lakehouse pattern

Land raw data in Cloud Storage.
Govern discovery and access with Dataplex and BigLake.
Process raw to curated zones using Dataflow, Dataproc, Data Fusion, or BigQuery.
Serve analytics in BigQuery.

Operational serving pattern

Use Cloud SQL or AlloyDB for traditional relational application databases.
Use Spanner for global relational scale with strong consistency.
Use Bigtable for extremely high-throughput key-value/time-series workloads.
Use Firestore for serverless document-based mobile/web apps.
Use Memorystore for caching, not durable storage.

Traps

Choosing BigQuery for user-facing low-latency transactional workloads.
Choosing Cloud Storage when users need SQL analytics and BI without defining BigQuery or BigLake access.
Choosing Cloud SQL for global scale and multi-region strong consistency when Spanner is the better fit.
Choosing Bigtable when the question requires joins, SQL, or multi-row transactions.
Choosing Firestore for analytical reporting.
Treating Memorystore as durable storage.
Ignoring lifecycle policies and storage class cost optimization for Cloud Storage.

Domain 4: Preparing and using data for analysis

Concepts

This domain tests whether you can prepare data for BI, ML, sharing, visualization, and secure analysis.

Key ideas:

BigQuery is central for analytical preparation.
Performance tuning often involves partitioning, clustering, pruning, query rewrite, materialized views, BI Engine, and avoiding SELECT *.
Security for analysis often uses row-level security, column-level security, policy tags, authorized views, masking, IAM, and DLP.
BigQuery ML is useful when the model can be trained and used directly in BigQuery with SQL.
Vertex AI is more appropriate for advanced custom ML workflows, feature stores, training pipelines, deployment, and MLOps.
Analytics Hub is used for controlled data sharing and publishing datasets.

Services

Requirement	Best fit
Fast BI dashboard over BigQuery	BI Engine, materialized views, aggregated tables, partitioning/clustering
Repeated expensive aggregations	Materialized views or scheduled summary tables
Controlled dataset sharing	Analytics Hub, authorized views, dataset access controls
Mask or classify PII	Cloud DLP / Sensitive Data Protection, policy tags, masking
SQL-based ML directly on warehouse data	BigQuery ML
Advanced custom ML lifecycle	Vertex AI
Prepare unstructured text for RAG	Embeddings, vector search patterns, preprocessing pipelines, governed storage

Patterns

BI performance pattern

Partition large fact tables by date.
Cluster by high-cardinality filter or join columns where useful.
Precompute common aggregates.
Use materialized views for repeated deterministic aggregations.
Enable BI Engine for interactive dashboards.
Avoid scanning unnecessary columns and partitions.

Secure sharing pattern

Publish curated datasets rather than raw sensitive data.
Use Analytics Hub for managed sharing.
Use authorized views to expose limited views.
Use policy tags and masking for sensitive columns.
Use row-level security for tenant, region, or department filtering.

ML preparation pattern

Use BigQuery to clean, join, and prepare structured features.
Use BigQuery ML for SQL-native models and simple forecasting/classification/regression.
Use Vertex AI for custom training, feature management, pipelines, endpoints, and MLOps.
For unstructured data/RAG, prepare chunking, metadata, embeddings, access controls, and retrieval quality validation.

Traps

Solving every slow dashboard with more slots before optimizing table design and query patterns.
Exposing raw datasets when curated views or shared listings are safer.
Using BigQuery ML for complex custom ML lifecycle requirements better served by Vertex AI.
Forgetting PII masking and column-level controls in analytics environments.
Using CSV exports as the primary sharing mechanism when Analytics Hub or BigQuery sharing is better.

Domain 5: Maintaining and automating data workloads

Concepts

This domain tests operational excellence: automation, cost, monitoring, troubleshooting, capacity, fault tolerance, and repeatability.

Key ideas:

Use Cloud Composer for DAG-based orchestration.
Use Dataform for repeatable SQL transformations in BigQuery.
Use Cloud Monitoring and Cloud Logging for observability.
Use BigQuery admin tools, job history, INFORMATION_SCHEMA, audit logs, reservations, and slot metrics for BigQuery troubleshooting.
Use reservations and Editions for predictable BigQuery capacity management.
Use partitioning, clustering, query optimization, and lifecycle policies before blindly scaling resources.
Use retries, idempotency, checkpoints, dead-letter topics, and alerting for failure management.

Services

Operational need	Service or feature
DAG scheduling and dependencies	Cloud Composer
SQL transformation dependencies	Dataform
API/service orchestration	Workflows
Pipeline metrics and alerts	Cloud Monitoring
Logs and error analysis	Cloud Logging
BigQuery troubleshooting	BigQuery admin panel, job history, INFORMATION_SCHEMA, audit logs
Capacity management	BigQuery Editions, reservations, slots
Cost controls	Budgets, labels, partitioning, clustering, lifecycle policies, reservations
Fault tolerance	Retries, checkpoints, dead-letter queues, idempotent writes, regional design

Patterns

Reliable DAG pattern

Keep DAG tasks small and idempotent.
Use retries with backoff.
Store secrets in Secret Manager, not in code.
Use service accounts with least privilege.
Monitor SLA misses, retries, and failures.
Do not run large transformations inside the scheduler process.

BigQuery cost pattern

Partition large tables by date or ingestion time.
Cluster when filters repeatedly use specific columns.
Avoid SELECT *.
Use dry runs and query estimates.
Use materialized views or summary tables for repeated calculations.
Use reservations for predictable capacity or workload isolation.
Use labels for chargeback and monitoring.

Failure recovery pattern

Detect with Cloud Monitoring and Logging.
Quarantine bad data.
Use idempotent reprocessing.
Use checkpoints or replayable sources.
Validate output completeness and quality.
Alert owners and maintain runbooks.

Traps

Using cron on a VM instead of Cloud Composer or managed scheduling for critical pipelines.
Scaling BigQuery slots without checking partition pruning, clustering, and query design.
Using persistent Dataproc clusters for infrequent jobs when ephemeral clusters reduce cost.
Ignoring quotas and billing alerts.
Not designing pipelines to restart safely.
Monitoring only infrastructure metrics while ignoring data quality and business-level pipeline metrics.

5. Service Selection Guide

BigQuery vs Cloud Storage vs BigLake

Need	Choose	Why
SQL analytics over structured data	BigQuery	Serverless warehouse optimized for analytical SQL
Raw files and low-cost durable object storage	Cloud Storage	Best landing zone and data lake object store
Governed analytics over lake data	BigLake	Provides lakehouse-style governance and BigQuery integration
BI dashboards with interactive SQL	BigQuery + BI Engine	Optimized for analytics and BI acceleration
Archive historical files cheaply	Cloud Storage lifecycle classes	Cost-effective long-term object storage

Dataflow vs Dataproc vs Cloud Data Fusion vs Dataform

Need	Choose	Do not choose
New batch/stream processing with managed Beam	Dataflow	Dataproc, unless Spark/Hadoop compatibility is required
Existing Spark/Hadoop/Hive jobs	Dataproc	Dataflow, if rewriting would add risk
Visual ETL with connectors and low-code development	Cloud Data Fusion	Hand-coded Dataflow, unless custom code is required
SQL transformations in BigQuery	Dataform	Composer DAG code for SQL dependency modeling
Heavy transformation logic inside scheduler	Dataflow/Dataproc/BigQuery/Dataform	Cloud Composer alone

Pub/Sub vs Cloud Composer vs Workflows

Need	Choose	Why
Event ingestion and decoupling	Pub/Sub	Messaging backbone for asynchronous events
Scheduled DAGs with dependencies	Cloud Composer	Managed Apache Airflow
Lightweight API orchestration	Workflows	Serverless orchestration without Airflow overhead
Streaming transformation	Dataflow reading Pub/Sub	Pub/Sub alone does not transform

Cloud SQL vs AlloyDB vs Spanner

Need	Choose	Why
Traditional managed relational database	Cloud SQL	Simple managed MySQL/PostgreSQL/SQL Server
High-performance PostgreSQL-compatible workload	AlloyDB	Better performance and availability for PostgreSQL-compatible apps
Globally distributed relational database with strong consistency	Spanner	Horizontal scale, global availability, strong consistency
Analytical warehouse	BigQuery	Operational databases are not optimized for petabyte analytics

Bigtable vs Firestore vs Memorystore

Need	Choose	Why
Massive key-value/wide-column reads/writes, time-series, IoT	Bigtable	Low-latency high-throughput NoSQL at scale
Mobile/web document database	Firestore	Serverless document model and app synchronization patterns
Cache/session store	Memorystore	Managed Redis/Memcached low-latency cache
SQL analytics	BigQuery	NoSQL/caches are wrong for warehouse analytics

BigQuery security features

Requirement	Feature
Hide sensitive columns	Policy tags, column-level security, masking
Restrict rows by user/tenant/region	Row-level security
Share only a curated subset	Authorized views
Detect/classify sensitive data	Cloud DLP / Sensitive Data Protection
Control encryption keys	Cloud KMS / CMEK
Audit usage	Cloud Audit Logs, BigQuery job history

BigQuery performance and cost features

Requirement	Feature
Reduce scanned data by date	Partitioning
Improve repeated filters/joins	Clustering
Accelerate repeated aggregations	Materialized views or summary tables
Improve dashboard performance	BI Engine
Predictable capacity and isolation	Reservations, slots, BigQuery Editions
Avoid accidental large scans	Dry runs, maximum bytes billed, query review

6. Architecture Patterns

Pattern 1: Real-time analytics pipeline

Scenario: App events must appear in dashboards within seconds or minutes.

Recommended solution: Pub/Sub → Dataflow → BigQuery → Looker/BI tool.

Why: Pub/Sub handles ingestion and decoupling. Dataflow handles streaming transformation, windows, late data, and enrichment. BigQuery serves analytics.