Cert-Pass
Log in Sign up
calendar_todayMay 29, 2026 schedule3 min read

GCP Data Engineer Streaming and Dataflow Guide 2026: Pub/Sub,

Master GCP Data Engineer streaming concepts. Pub/Sub, Dataflow windows, watermarks, late data, dead-letter topics, and real-time pipeline design.

gcp data engineer streaming gcp dataflow gcp pub/sub dataflow windows gcp streaming analytics
Google

GCP Professional Data Engineer

Practice Now
GCP Data Engineer Streaming and Dataflow Guide 2026: Pub/Sub,

Streaming data is a core part of the GCP Professional Data Engineer exam. You need to know Pub/Sub for ingestion, Dataflow for processing, and the windowing/watermark concepts that make streaming work reliably. This guide covers every streaming pattern that shows up on the exam.

The Streaming Stack The default GCP streaming analytics pipeline: Producers (apps, IoT, logs) โ†“ Pub/Sub (event ingestion + decoupling) โ†“ Dataflow (transformation, windowing, enrichment) โ†“ BigQuery (analytics) or Bigtable (serving) โ†“ Dead-letter topic (failed records) Pub/Sub ingests and decouples. Dataflow transforms and windows. BigQuery stores for analytics. This three-service pattern is the exam's default answer for any "real-time analytics" scenario.

Pub/Sub Deep Dive Pub/Sub concepts tested: | Concept | What It Is | Exam Relevance | |

|

|

| | Topic | Named resource that receives messages | Publishers send to topics | | Subscription | Named resource that receives messages from a topic | Pull or push delivery | | Message | Data + attributes + message ID + publish time | Max 10 MB per message | | Acknowledgment | Subscriber confirms processing | Unacked messages are redelivered | | Exactly-once | Deduplication of messages | Newer feature, tested on exam | | Ordering | Messages delivered in order per ordering key | Must be explicitly enabled | | Dead-letter topic | Topic for messages that fail processing | After max delivery attempts | | Retention | Messages retained for up to 31 days (default 7) | Late subscribers can catch up | Pull vs Push subscriptions:: Pull: subscriber requests messages (more control, common with Dataflow): Push: Pub/Sub sends messages to an HTTPS endpoint (simpler, good for webhooks) Exam pattern: "A Dataflow pipeline reads from Pub/Sub and processes events. What subscription type should be used?" Pull. Dataflow uses pull subscriptions. Dead-letter topics are tested. After N failed delivery attempts (default 5), messages go to the dead-letter topic instead of being retried forever. This is the answer for "what happens to messages that repeatedly fail processing?"

Dataflow and Apache Beam Concepts Dataflow is Google's managed Apache Beam service. The exam tests Beam concepts applied through Dataflow. Key Beam concepts: | Concept | What It Does | |

|

| | PCollection | Distributed dataset (the data being processed) | | PTransform | Operation that transforms a PCollection | | Pipeline | The full data processing graph | | Runner | Execution engine (Dataflow runner for GCP) | Batch vs Streaming in Dataflow:: Batch: process bounded data (files, database dumps): Streaming: process unbounded data (Pub/Sub, Kafka): The same pipeline code can run in both modes (Beam's unified model) Exam pattern: "A company needs to process a continuous stream of clickstream events from Pub/Sub." Dataflow in streaming mode with a Pub/Sub source.

Windowing Windowing divides a continuous stream into finite chunks for aggregation. The exam tests three window types: | Window Type | How It Works | Example | |

|

|

| | Fixed (tumbling) | Non-overlapping time intervals | Count events per 1-minute window | | Sliding | Overlapping time intervals | Average over last 5 minutes, updated every minute | | Session | Activity-based gaps | Group events with <30 min gap into one session | Window assignment: Each event is assigned to one or more windows based on its event time (not processing time). Exam pattern: "Count the number of events per minute." Fixed (tumbling) window of 1 minute. Not "session window" (that's for activity-based grouping). Exam pattern: "Calculate a rolling average over the last 5 minutes, updated every minute." Sliding window of 5-minute period, 1-minute slide.

Watermarks and Late Data Watermark is Dataflow's estimate of how complete the data is. It represents "we expect all data up to this time has arrived." As time progresses, the watermark advances. Late data arrives after the watermark has passed its event time. By default, late data is dropped. Handling late data (tested on exam): 1. Allowed lateness: configure how long to wait for late data (e.g., 10 minutes) 2. Triggers: determine when to emit results (on watermark, on count, etc.) 3. Accumulation mode: discard (first result only) or accumulate (update with late data) Exam pattern: "A streaming pipeline misses some events that arrive 5 minutes late. What should be configured?" Increase the allowed lateness window. Not "add more workers" or "use a larger Pub/Sub subscription." The exam trap: "Use processing-time windows instead of event-time windows to handle late data." Wrong. Processing-time windows don't solve late data: they ignore it. Use event-time windows with allowed lateness and watermarks.

Triggers Triggers determine when to emit windowed results: | Trigger | When It Fires | |

|

| | AfterWatermark | When the watermark passes the window end (default) | | AfterProcessingTime | After some processing time has elapsed | | AfterCount | After N elements have been received | | Repeatedly | Fire on every trigger condition | Exam pattern: "Results should be emitted as early as possible after 100 events arrive, even if the watermark hasn't passed the window end." AfterCount(100) trigger.

Exactly-Once Processing Dataflow provides exactly-once processing semantics when:: Using Pub/Sub as source: Using BigQuery as sink: Pipeline is configured correctly This means each event is processed exactly once, even if workers fail and restart. Exam pattern: "A company needs each event to be processed exactly once, even during worker failures." Dataflow with Pub/Sub source and exactly-once enabled. Not "write custom deduplication logic."

Error Handling in Streaming Pipelines Dead-letter pattern: 1. Events that fail processing go to a dead-letter topic (Pub/Sub) or dead-letter table (BigQuery) 2. Failed events don't block the main pipeline 3. Failed events can be analyzed and reprocessed Idempotent sinks are tested. Since Pub/Sub provides at-least-once delivery, the sink must handle duplicates. BigQuery with WRITE_APPEND and merge logic, or using unique event IDs. Exam pattern: "A streaming pipeline writes to BigQuery. Some events are duplicated due to Pub/Sub's at-least-once delivery. What should the sink configuration handle?" Idempotent writes using merge logic or deduplication.

Real-Time Pipeline Design Pattern The full pattern the exam expects: 1. Event producers publish to Pub/Sub topic 2. Dataflow pipeline reads from Pub/Sub (pull subscription) 3. Dataflow applies:: Parsing and validation: Enrichment (dimension lookups): Windowing (fixed, sliding, or session): Watermark tracking and allowed lateness: Aggregation 4. Valid results written to BigQuery 5. Failed records to dead-letter Pub/Sub topic 6. Cloud Monitoring alerts on backlog, errors, and latency Exam pattern: "Design a real-time dashboard showing clicks per minute from a website." This full pattern above. Not "batch load files every hour": that's not real-time.

Frequently Asked Questions ### What is the gcp data engineer streaming and dataflow? The gcp data engineer streaming and dataflow is a professional certification that validates your skills. It is recognized by employers globally and can significantly boost your career prospects. ### How much does the gcp data engineer streaming and dataflow exam cost? Exam costs vary by provider. AWS exams typically cost 100 to 300 USD. Microsoft exams cost 165 USD. Google Cloud exams cost 200 USD. Check the official provider page for current pricing. ### How long should I study for the gcp data engineer streaming and dataflow? Most people need 4 to 8 weeks of consistent study. If you have hands-on experience, 4 weeks may be enough. If you are new to the platform, plan for 8 to 12 weeks.

Related Articles - AWS Cloud Practitioner study guide - AWS SAA-C03 exam guide - AWS practice questions - Azure AI-102 study guide - Azure AI practice questions - DP-700 study guide - GCP Cloud Architect guide - GCP Data Engineer patterns - GCP practice questions

school

Cert-Pass Editorial Team

Cloud certification experts helping IT professionals pass their exams with confidence.

link Related Exam Resources

Expert-Crafted Study Guide

Everything You Need to Pass GCP Professional Data Engineer: Visualized

GCP Professional Data Engineer certification preparation infographic

Put your knowledge to the test

Practice with real exam questions, track your progress, and pass with confidence.

quiz Start Practicing Free