Cert-Pass
Log in Sign up
Data Engineer Associate
Databricks Databricks Certified auto_stories Compressed Course

Data Engineer Associate Compressed Course

bolt What you should know to pass the exam

6 expert modules from 322+ real exam questions. Covers every domain, exam trap, and scenario.

6
Modules
322+
Questions
93%
Pass Rate
Introduction 1 / 7
Databricks

Welcome to the Data Engineer Associate Compressed Course

What you should know to pass the exam

This comprehensive guide is derived from our 322+ real exam practice questions and prepared by our team of Databricks certification experts. It covers every topic tested on the Data Engineer Associate exam.

info What this guide covers:

  • Every exam domain with detailed explanations
  • Common exam traps that catch unprepared candidates
  • Key concepts and syntax for each module
  • Real-world scenarios from actual exam questions
  • Quick-reference cheat sheets for last-minute review

Use the Next and Back buttons (or arrow keys) to navigate. Each module covers one exam domain.

account_tree Data Engineer Associate
1
The Databricks Intelligence Platform (10%)
10% 33Q
1.1 The Two-Plane Architecture
1.2 FinOps and Cluster Types
1.3 Git Integration and Databricks Repos
2
Development and Data Ingestion (30%)
30% 50Q
2.1 Auto Loader (`cloudFiles`)
2.2 The `COPY INTO` Command
2.3 Structured Streaming Triggers
3
Data Processing and Transformations (31%)
31% 150Q
3.1 Delta Lake Internals
3.2 Time Travel
3.3 VACUUM
3.4 OPTIMIZE and Z-ORDER
3.5 Spark SQL Higher-Order Functions
3.6 Common Table Expressions and Window Functions
lock Premium Content
radio_button_unchecked Productionizing Pipelines (18%)
radio_button_unchecked Data Governance and Quality (11%)
radio_button_unchecked Quick-Reference Cheat Sheet
schedule 3 months access
Upgrade to Unlock
1 Module 1 of 6
quiz 33 real exam questions

Module 1: The Databricks Intelligence Platform (10%)

The foundation of Databricks is its separation of compute from storage — understanding this architecture is essential before anything else makes sense.

1.1 The Two-Plane Architecture

Databricks does not store your data. It processes it. The platform is split into two distinct planes:

Control Plane — Hosted in Databricks' own cloud account. Manages:

  • The web UI and notebook editor
  • Job scheduler and cluster configuration
  • Code you write (notebooks are stored here)

Data Plane — Hosted in your cloud account (AWS, Azure, or GCP). Contains:

  • The virtual machines (clusters) that execute your code
  • Your actual data storage (S3, ADLS, GCS)

Serverless Compute — A newer model where Databricks moves the compute infrastructure entirely into their own environment. You no longer wait 5–10 minutes for a cluster to boot, and you never manage instance types. It scales instantly and is billed per second of use.


1.2 FinOps and Cluster Types

The exam heavily tests your ability to choose the most cost-efficient cluster for a given scenario.

Cluster Type Use Case Billing Rate Key Trait
All-Purpose Interactive notebooks, ad-hoc queries Premium (expensive) Stays alive between runs
Job Cluster Automated pipelines Discounted (cheap) Spins up, runs, terminates
Cluster Pool Reducing cold-start time Idle VM cost only Drops start time from ~5 min to ~10 sec

Exam Rule: For any question involving automated or scheduled tasks, always select Job Cluster. All-Purpose clusters are for humans interacting with notebooks in real time.

Cluster Pools keep a set of virtual machines powered on but idle. When a job needs to start, it pulls from the pool instead of provisioning new VMs from scratch — dramatically reducing cold-start latency.


1.3 Git Integration and Databricks Repos

Databricks Repos allow you to sync your workspace directly with GitHub, GitLab, or Bitbucket.

Supported operations inside the Databricks UI:

  • Clone a repository
  • Pull latest changes
  • Push commits
  • Create and checkout branches
  • Commit changes

Not supported (exam trap): You cannot resolve merge conflicts or open Pull Requests inside the Databricks UI. These operations must be performed in your Git provider (GitHub, GitLab, etc.).


2 Module 2 of 6
quiz 50 real exam questions

Module 2: Development and Data Ingestion (30%)

This is the highest-weighted domain alongside Module 3. The focus is on the syntax and mechanics of getting raw files into Delta format reliably and incrementally.

2.1 Auto Loader (cloudFiles)

Auto Loader is Databricks' flagship tool for ingesting files from cloud storage incrementally. It is designed to handle millions of files (JSON, CSV, Parquet, Avro, etc.) without reprocessing files it has already seen.

Core mechanism: Auto Loader uses cloud-native file notification services (e.g., AWS SQS + SNS) to detect new files as they land, rather than listing the entire directory on every run.

Syntax you must know:

(spark.readStream .format("cloudFiles") # Auto Loader always uses this format .option("cloudFiles.format", "json") # The actual file format on disk .option("cloudFiles.schemaLocation", "/path/to/schema") # Required for schema inference .load("/data/raw_json/") .writeStream .option("checkpointLocation", "/path/to/checkpoint") .table("bronze_table") )

Schema Evolution: If a new column appears in tomorrow's JSON files, Auto Loader will fail by default. To safely absorb new columns without breaking the pipeline:

.option("cloudFiles.schemaEvolutionMode", "addNewColumns")

This tells Auto Loader to automatically update the schema of the target table when new columns are detected, rather than throwing an error.


2.2 The COPY INTO Command

COPY INTO is the SQL-native alternative to Auto Loader for incremental file loading.

Key property — Idempotency: COPY INTO maintains an internal record of every file it has already processed. Running the same command multiple times will never duplicate data.

COPY INTO target_delta_table FROM '/path/to/raw/data/' FILEFORMAT = CSV FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true');

If you schedule this command to run every hour, it will only load files that arrived since the last run. Files already loaded are permanently tracked and skipped.

Auto Loader vs. COPY INTO — When to use which:

Scenario Recommended Tool
Millions of files, high-volume ingestion Auto Loader (cloudFiles)
Thousands of files, simpler SQL-based setup COPY INTO
Need schema evolution handling Auto Loader
Idempotent batch loads from a known path COPY INTO

2.3 Structured Streaming Triggers

A streaming job running 24/7 is expensive. Databricks provides trigger modes to control when and how a stream processes data.

trigger(availableNow=True) — The most tested trigger on the exam.

(df.writeStream .trigger(availableNow=True) .option("checkpointLocation", "/path/to/checkpoint") .table("silver_table") )

Behavior: When the job runs, it processes all data that has accumulated since the last checkpoint in multiple micro-batches, then automatically shuts the stream down. It combines the efficiency of batch processing with the incremental awareness of streaming.

trigger(processingTime="5 minutes") — Processes a micro-batch on a fixed interval. The stream stays alive between batches.

trigger(once=True) — Deprecated predecessor to availableNow. Processed all available data in a single micro-batch, then shut down. availableNow is faster because it uses multiple micro-batches.


3 Module 3 of 6
quiz 150 real exam questions

Module 3: Data Processing and Transformations (31%)

The largest module by weight. Covers Delta Lake internals, file management commands, and Spark SQL syntax for complex transformations.

3.1 Delta Lake Internals

Delta Lake is not a separate database engine. It is a storage layer built on top of Parquet files, enhanced by a transaction log.

Physical structure of a Delta table:

/my_delta_table/ _delta_log/ ← JSON transaction log (the "brain") 00000.json 00001.json 00002.json part-00001.parquet part-00002.parquet

Every write operation (INSERT, UPDATE, DELETE, MERGE) appends a new JSON entry to the _delta_log. This log is what enables ACID transactions, time travel, and schema enforcement.


3.2 Time Travel

Because every version of the table is recorded in the transaction log, you can query any historical state.

-- Query by timestamp SELECT * FROM sales TIMESTAMP AS OF '2026-05-10 12:00:00'; -- Query by version number SELECT * FROM sales VERSION AS OF 5; -- Restore the entire table to a previous state RESTORE TABLE sales TO VERSION AS OF 5; -- View the full version history DESCRIBE HISTORY sales;

3.3 VACUUM

Time travel requires keeping old data files on disk. VACUUM permanently deletes data files that are no longer referenced by the transaction log.

VACUUM sales; -- Uses default 7-day retention VACUUM sales RETAIN 168 HOURS; -- Explicit 7-day retention VACUUM sales RETAIN 0 HOURS DRY RUN; -- Preview what would be deleted

Critical exam trap: You cannot VACUUM data younger than 7 days (168 hours) by default. Attempting to do so will throw an error. This protection exists to prevent you from deleting files that active streaming queries or concurrent readers might still be referencing.

To override this protection (not recommended in production):

SET spark.databricks.delta.retentionDurationCheck.enabled = false; VACUUM sales RETAIN 0 HOURS;

3.4 OPTIMIZE and Z-ORDER

Over time, Delta tables accumulate thousands of tiny files from frequent small writes (a problem called the "small file problem"). OPTIMIZE solves this.

-- Compact small files into larger, efficient files (~1GB each) OPTIMIZE sales; -- Compact AND physically co-locate related data OPTIMIZE sales ZORDER BY (store_id);

OPTIMIZE merges many small Parquet files into fewer large files, dramatically improving read performance and reducing metadata overhead.

Z-ORDER goes further: it physically sorts and co-locates rows with the same store_id value within the same files. When a query filters by store_id, Databricks can skip entire files that don't contain the requested value — a technique called data skipping. This can reduce the amount of data scanned by 80–95%.

When to use Z-ORDER: On columns that are frequently used in WHERE clauses, JOIN conditions, or GROUP BY operations. Limit to 1–4 columns for best results.


3.5 Spark SQL Higher-Order Functions

These functions operate on array columns (common in JSON-sourced data) and are directly tested on the exam.

FILTER(array, condition) — Returns a new array containing only elements that satisfy the condition.

-- Return only employees with more than 5 years of experience SELECT FILTER(employees_array, emp -> emp.years_exp > 5) AS senior_employees FROM stores;

TRANSFORM(array, expression) — Applies a transformation to every element in an array and returns a new array of the same length.

-- Apply a 10% raise to every salary in the array SELECT TRANSFORM(salary_array, s -> s * 1.10) AS adjusted_salaries FROM employees;

AGGREGATE(array, zero, merge, finish) — Reduces an array to a single value (similar to reduce in programming).

-- Sum all values in an array SELECT AGGREGATE(scores_array, 0, (acc, x) -> acc + x) AS total_score FROM students;

EXISTS(array, condition) — Returns true if at least one element in the array satisfies the condition.

-- Find stores that have at least one employee with a manager role SELECT EXISTS(employees_array, emp -> emp.role = 'manager') AS has_manager FROM stores;

3.6 Common Table Expressions and Window Functions

CTEs improve readability and are heavily used in multi-step SQL transformations:

WITH monthly_sales AS ( SELECT store_id, DATE_TRUNC('month', sale_date) AS month, SUM(amount) AS total FROM sales GROUP BY store_id, month ), ranked_stores AS ( SELECT *, RANK() OVER (PARTITION BY month ORDER BY total DESC) AS rank FROM monthly_sales ) SELECT * FROM ranked_stores WHERE rank <= 3;

Window functions compute values across a set of rows related to the current row without collapsing them into a single output row:

SELECT store_id, sale_date, amount, SUM(amount) OVER (PARTITION BY store_id ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total, LAG(amount, 1) OVER (PARTITION BY store_id ORDER BY sale_date) AS previous_sale FROM sales;

4 Module 4 of 6
quiz 37 real exam questions

Module 4: Productionizing Pipelines (18%)

Module 4: Productionizing Pipelines (18%)

Once your transformations work, you need to schedule, orchestrate, and automate them reliably. This module covers Databricks Workflows and Delta Live Tables.

4.1 Databricks Workflows (Jobs)

A Workflow is a directed acyclic graph (DAG) of tasks. Each task can be a Notebook, Python script, SQL query, dbt model, or a DLT pipeline.

Task dependency behavior:

Task A (Extract) --> Task B (Transform) --> Task C (Load)
  • If Task A succeeds, Task B runs normally.
  • If Task A fails, Task B is marked as "Upstream Failed" and does not execute.
  • Task C is also marked "Upstream Failed" by cascade.

Repair and Rerun: If Task A fails and you fix the root cause, you can use "Repair Run" to re-execute only the failed task and its downstream dependents — without re-running tasks that already succeeded.

Multi-task flexibility: A single workflow can orchestrate Python notebooks, SQL notebooks, dbt Core projects, and DLT pipelines all together in one DAG.


4.2 Delta Live Tables (DLT)

DLT is a declarative framework for building the Medallion Architecture. Instead of writing imperative code that manages cluster lifecycle, checkpoints, and dependencies manually, you declare what you want each table to contain and DLT handles the rest.

Syntax comparison:

# Bronze layer — Streaming Live Table (incremental, append-only) @dlt.table(comment="Raw ingestion from cloud storage") def bronze_orders(): return ( spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") .load("/raw/orders/") ) # Silver layer — Streaming Live Table with quality expectations @dlt.table(comment="Cleaned and validated orders") @dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL") def silver_orders(): return dlt.read_stream("bronze_orders").where("amount > 0") # Gold layer — Live Table (full recompute, materialized view) @dlt.table(comment="Aggregated daily revenue by store") def gold_daily_revenue(): return ( dlt.read("silver_orders") .groupBy("store_id", "order_date") .agg({"amount": "sum"}) )

Key distinction:

Table Type SQL Syntax Python Decorator Behavior
Streaming Live Table CREATE STREAMING LIVE TABLE @dlt.table + readStream Incremental, processes new data only
Live Table CREATE LIVE TABLE @dlt.table + read Full recompute on every pipeline run

Execution modes:

Mode Cost Latency Best For
Continuous High (runs 24/7) Near-zero Real-time dashboards, fraud detection
Triggered Low (runs on schedule) Minutes Nightly batch pipelines, cost-sensitive workloads

lock_open

Module 4 is locked

Unlock all 6 modules with exam traps and cheat sheets.

5 Module 5 of 6
quiz 52 real exam questions

Module 5: Data Governance and Quality (11%)

Module 5: Data Governance and Quality (11%)

Governance is about knowing who can see what data, and quality is about ensuring the data that flows through your pipelines is trustworthy.

5.1 Unity Catalog

Unity Catalog is the centralized, cross-workspace governance layer for Databricks. It provides a single place to manage access control, data lineage, and auditing across all workspaces in an organization.

Three-level namespace hierarchy:

Metastore └── Catalog (e.g., main, dev, prod) └── Schema / Database (e.g., sales_db) └── Table / View / Volume

Granting access — you must grant at every level:

-- Step 1: Grant access to the catalog GRANT USAGE ON CATALOG main TO data_team; -- Step 2: Grant access to the specific database GRANT USAGE ON DATABASE main.sales_db TO data_team; -- Step 3: Grant the actual data permission GRANT SELECT ON TABLE main.sales_db.orders TO data_team;

Exam trap: Granting SELECT on a table is not enough if the team does not also have USAGE on the parent catalog and schema. All three grants are required.

Common privilege types:

Privilege Scope Effect
USAGE Catalog, Schema Allows navigation into the object
SELECT Table, View Allows reading data
MODIFY Table Allows INSERT, UPDATE, DELETE
CREATE Schema Allows creating new tables
ALL PRIVILEGES Any Grants all applicable permissions

5.2 Delta Sharing

Delta Sharing is an open protocol built into Unity Catalog that allows you to share live, read-only data with external consumers — without copying the data and without requiring the recipient to have a Databricks account.

Supported recipients can use: Snowflake, BigQuery, Apache Spark, Pandas, Power BI, and any tool that supports the open Delta Sharing protocol.

Key properties:

  • Data is never copied — the recipient queries your Delta table directly through a secure token
  • You control exactly which tables, schemas, and even row/column subsets are shared
  • Access can be revoked instantly

5.3 DLT Data Quality Expectations

DLT provides three enforcement modes for data quality rules. Choosing the wrong one is a common exam mistake.

Mode 1 — Monitor only (keep bad rows):

@dlt.expect("valid_id", "id IS NOT NULL")
  • Bad rows are written to the table
  • The violation is recorded in the pipeline event log
  • The pipeline continues running
  • Use when: you want visibility into data quality issues without blocking the pipeline

Mode 2 — Drop bad rows:

@dlt.expect_or_drop("valid_id", "id IS NOT NULL")
  • Bad rows are silently dropped and never written
  • Good rows are written normally
  • The pipeline continues running
  • Use when: bad records are expected occasionally and downstream tables should only contain clean data

Mode 3 — Fail the pipeline:

@dlt.expect_or_fail("valid_id", "id IS NOT NULL")
  • Any bad row causes the entire pipeline to stop immediately
  • No data is written for that batch
  • Use when: data quality is non-negotiable (financial transactions, regulatory reporting) and writing bad data would corrupt downstream systems

Decision guide:

Expectation Mode Bad Row Fate Pipeline Fate Use Case
expect Kept, flagged Continues Monitoring and observability
expect_or_drop Dropped Continues Normal data cleaning
expect_or_fail Blocks write Fails immediately Critical, zero-tolerance data

lock_open

Module 5 is locked

Unlock all 6 modules with exam traps and cheat sheets.

6 Module 6 of 6

Quick-Reference Cheat Sheet

Quick-Reference Cheat Sheet

This section consolidates the most exam-critical facts into a single scannable reference.

Architecture

  • Control Plane = Databricks' cloud (UI, notebooks, job scheduler)
  • Data Plane = Your cloud (VMs, S3/ADLS/GCS)
  • Serverless = Compute in Databricks' cloud, instant scale, no cluster management

Cluster Selection Rules

  • Automated job → Job Cluster (cheap, terminates after run)
  • Interactive development → All-Purpose Cluster (expensive, stays alive)
  • Fast cold starts → Cluster Pool (idle VMs on standby)

Ingestion Commands

# Auto Loader — high-volume incremental file ingestion spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaLocation", "/schema/path") .option("cloudFiles.schemaEvolutionMode", "addNewColumns") # handle new columns .load("/raw/path") # Streaming trigger — process all available data then stop .trigger(availableNow=True)
-- COPY INTO — SQL-native idempotent batch load COPY INTO target_table FROM '/raw/path' FILEFORMAT = CSV;

Delta Lake Commands

-- Time travel SELECT * FROM table TIMESTAMP AS OF '2026-05-10'; SELECT * FROM table VERSION AS OF 5; RESTORE TABLE table TO VERSION AS OF 5; DESCRIBE HISTORY table; -- File management OPTIMIZE table ZORDER BY (column); -- compact files + data skipping VACUUM table RETAIN 168 HOURS; -- delete old files (min 7 days default) -- Cannot VACUUM below 7 days without disabling safety check SET spark.databricks.delta.retentionDurationCheck.enabled = false;

Higher-Order Functions

FILTER(array_col, x -> x.field > value) -- filter array elements TRANSFORM(array_col, x -> x * 1.1) -- apply function to each element AGGREGATE(array_col, 0, (acc, x) -> acc + x) -- reduce array to single value EXISTS(array_col, x -> x.field = 'value') -- boolean: any element matches

Unity Catalog Grants

-- Always grant at catalog, schema, AND table level GRANT USAGE ON CATALOG catalog_name TO role; GRANT USAGE ON DATABASE catalog.schema TO role; GRANT SELECT ON TABLE catalog.schema.table TO role;

DLT Expectations

@dlt.expect("rule_name", "condition") # keep bad rows, log violation @dlt.expect_or_drop("rule_name", "condition") # drop bad rows, continue @dlt.expect_or_fail("rule_name", "condition") # bad row = pipeline failure

DLT Table Types

# Streaming Live Table — incremental, Bronze/Silver @dlt.table def my_table(): return spark.readStream... # Live Table — full recompute, Gold aggregations @dlt.table def my_aggregate(): return dlt.read("upstream_table").groupBy(...)

Git in Databricks Repos

  • Supported: clone, pull, push, commit, branch checkout
  • Not supported in UI: merge conflict resolution, pull requests

VACUUM Retention Rule

  • Default minimum retention: 7 days / 168 hours
  • Attempting to vacuum below this threshold throws an error
  • Override requires explicitly disabling the retention check (not recommended in production)
lock_open

Module 6 is locked

Unlock all 6 modules with exam traps and cheat sheets.

Course Modules

1 / 7

Unlock All Modules

Get full access to all 6 modules

auto_stories More Guides