Streaming data is a core part of the GCP Professional Data Engineer exam. You need to know Pub/Sub for ingestion, Dataflow for processing, and the windowing/watermark concepts that make streaming work reliably. This guide covers every streaming pattern that shows up on the exam.
The Streaming Stack The default GCP streaming analytics pipeline: Producers (apps, IoT, logs) โ Pub/Sub (event ingestion + decoupling) โ Dataflow (transformation, windowing, enrichment) โ BigQuery (analytics) or Bigtable (serving) โ Dead-letter topic (failed records) Pub/Sub ingests and decouples. Dataflow transforms and windows. BigQuery stores for analytics. This three-service pattern is the exam's default answer for any "real-time analytics" scenario.
Pub/Sub Deep Dive Pub/Sub concepts tested: | Concept | What It Is | Exam Relevance | |
|
|
| | Topic | Named resource that receives messages | Publishers send to topics | | Subscription | Named resource that receives messages from a topic | Pull or push delivery | | Message | Data + attributes + message ID + publish time | Max 10 MB per message | | Acknowledgment | Subscriber confirms processing | Unacked messages are redelivered | | Exactly-once | Deduplication of messages | Newer feature, tested on exam | | Ordering | Messages delivered in order per ordering key | Must be explicitly enabled | | Dead-letter topic | Topic for messages that fail processing | After max delivery attempts | | Retention | Messages retained for up to 31 days (default 7) | Late subscribers can catch up | Pull vs Push subscriptions:: Pull: subscriber requests messages (more control, common with Dataflow): Push: Pub/Sub sends messages to an HTTPS endpoint (simpler, good for webhooks) Exam pattern: "A Dataflow pipeline reads from Pub/Sub and processes events. What subscription type should be used?" Pull. Dataflow uses pull subscriptions. Dead-letter topics are tested. After N failed delivery attempts (default 5), messages go to the dead-letter topic instead of being retried forever. This is the answer for "what happens to messages that repeatedly fail processing?"
Dataflow and Apache Beam Concepts Dataflow is Google's managed Apache Beam service. The exam tests Beam concepts applied through Dataflow. Key Beam concepts: | Concept | What It Does | |
|
| | PCollection | Distributed dataset (the data being processed) | | PTransform | Operation that transforms a PCollection | | Pipeline | The full data processing graph | | Runner | Execution engine (Dataflow runner for GCP) | Batch vs Streaming in Dataflow:: Batch: process bounded data (files, database dumps): Streaming: process unbounded data (Pub/Sub, Kafka): The same pipeline code can run in both modes (Beam's unified model) Exam pattern: "A company needs to process a continuous stream of clickstream events from Pub/Sub." Dataflow in streaming mode with a Pub/Sub source.
Windowing Windowing divides a continuous stream into finite chunks for aggregation. The exam tests three window types: | Window Type | How It Works | Example | |
|
|
| | Fixed (tumbling) | Non-overlapping time intervals | Count events per 1-minute window | | Sliding | Overlapping time intervals | Average over last 5 minutes, updated every minute | | Session | Activity-based gaps | Group events with <30 min gap into one session | Window assignment: Each event is assigned to one or more windows based on its event time (not processing time). Exam pattern: "Count the number of events per minute." Fixed (tumbling) window of 1 minute. Not "session window" (that's for activity-based grouping). Exam pattern: "Calculate a rolling average over the last 5 minutes, updated every minute." Sliding window of 5-minute period, 1-minute slide.
Watermarks and Late Data Watermark is Dataflow's estimate of how complete the data is. It represents "we expect all data up to this time has arrived." As time progresses, the watermark advances. Late data arrives after the watermark has passed its event time. By default, late data is dropped. Handling late data (tested on exam): 1. Allowed lateness: configure how long to wait for late data (e.g., 10 minutes) 2. Triggers: determine when to emit results (on watermark, on count, etc.) 3. Accumulation mode: discard (first result only) or accumulate (update with late data) Exam pattern: "A streaming pipeline misses some events that arrive 5 minutes late. What should be configured?" Increase the allowed lateness window. Not "add more workers" or "use a larger Pub/Sub subscription." The exam trap: "Use processing-time windows instead of event-time windows to handle late data." Wrong. Processing-time windows don't solve late data: they ignore it. Use event-time windows with allowed lateness and watermarks.
Triggers Triggers determine when to emit windowed results: | Trigger | When It Fires | |
|
| | AfterWatermark | When the watermark passes the window end (default) | | AfterProcessingTime | After some processing time has elapsed | | AfterCount | After N elements have been received | | Repeatedly | Fire on every trigger condition | Exam pattern: "Results should be emitted as early as possible after 100 events arrive, even if the watermark hasn't passed the window end." AfterCount(100) trigger.