arrow_back Cert

Databricks Data Engineer Associate

Study & Mastery Mode

🔥 0 streak

Score 0% /70%

timer Mock lock download Free

menu_book

Data Engineer Associate

Comprehensive Study Guide

lock

Study Guide Preview

Unlock the full study guide with a Q&A + Course purchase.

View Plans

The Comprehensive Databricks Data Engineer Associate Masterclass

This masterclass covers every major domain tested on the Databricks Data Engineer Associate exam, organized by module weight. Each section combines conceptual explanation with precise syntax, exam traps, and a cheat sheet summary so you can study efficiently and confidently.

Module 1: The Databricks Intelligence Platform (10%)

The foundation of Databricks is its separation of compute from storage — understanding this architecture is essential before anything else makes sense.

1.1 The Two-Plane Architecture

Databricks does not store your data. It processes it. The platform is split into two distinct planes:

Control Plane — Hosted in Databricks' own cloud account. Manages:

The web UI and notebook editor
Job scheduler and cluster configuration
Code you write (notebooks are stored here)

Data Plane — Hosted in your cloud account (AWS, Azure, or GCP). Contains:

The virtual machines (clusters) that execute your code
Your actual data storage (S3, ADLS, GCS)

Serverless Compute — A newer model where Databricks moves the compute infrastructure entirely into their own environment. You no longer wait 5–10 minutes for a cluster to boot, and you never manage instance types. It scales instantly and is billed per second of use.

1.2 FinOps and Cluster Types

The exam heavily tests your ability to choose the most cost-efficient cluster for a given scenario.

Cluster Type	Use Case	Billing Rate	Key Trait
All-Purpose	Interactive notebooks, ad-hoc queries	Premium (expensive)	Stays alive between runs
Job Cluster	Automated pipelines	Discounted (cheap)	Spins up, runs, terminates
Cluster Pool	Reducing cold-start time	Idle VM cost only	Drops start time from ~5 min to ~10 sec

Exam Rule: For any question involving automated or scheduled tasks, always select Job Cluster. All-Purpose clusters are for humans interacting with notebooks in real time.

Cluster Pools keep a set of virtual machines powered on but idle. When a job needs to start, it pulls from the pool instead of provisioning new VMs from scratch — dramatically reducing cold-start latency.

1.3 Git Integration and Databricks Repos

Databricks Repos allow you to sync your workspace directly with GitHub, GitLab, or Bitbucket.

Supported operations inside the Databricks UI:

Clone a repository
Pull latest changes
Push commits
Create and checkout branches
Commit changes

Not supported (exam trap): You cannot resolve merge conflicts or open Pull Requests inside the Databricks UI. These operations must be performed in your Git provider (GitHub, GitLab, etc.).

Module 2: Development and Data Ingestion (30%)

This is the highest-weighted domain alongside Module 3. The focus is on the syntax and mechanics of getting raw files into Delta format reliably and incrementally.

2.1 Auto Loader (`cloudFiles`)

Auto Loader is Databricks' flagship tool for ingesting files from cloud storage incrementally. It is designed to handle millions of files (JSON, CSV, Parquet, Avro, etc.) without reprocessing files it has already seen.

Core mechanism: Auto Loader uses cloud-native file notification services (e.g., AWS SQS + SNS) to detect new files as they land, rather than listing the entire directory on every run.

Syntax you must know:

(spark.readStream
    .format("cloudFiles")                              # Auto Loader always uses this format
    .option("cloudFiles.format", "json")               # The actual file format on disk
    .option("cloudFiles.schemaLocation", "/path/to/schema")  # Required for schema inference
    .load("/data/raw_json/")
    .writeStream
    .option("checkpointLocation", "/path/to/checkpoint")
    .table("bronze_table")
)

Schema Evolution: If a new column appears in tomorrow's JSON files, Auto Loader will fail by default. To safely absorb new columns without breaking the pipeline:

.option("cloudFiles.schemaEvolutionMode", "addNewColumns")

This tells Auto Loader to automatically update the schema of the target table when new columns are detected, rather than throwing an error.

2.2 The `COPY INTO` Command

COPY INTO is the SQL-native alternative to Auto Loader for incremental file loading.

Key property — Idempotency: COPY INTO maintains an internal record of every file it has already processed. Running the same command multiple times will never duplicate data.

COPY INTO target_delta_table
FROM '/path/to/raw/data/'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true');

If you schedule this command to run every hour, it will only load files that arrived since the last run. Files already loaded are permanently tracked and skipped.

Auto Loader vs. COPY INTO — When to use which:

Scenario	Recommended Tool
Millions of files, high-volume ingestion	Auto Loader (`cloudFiles`)
Thousands of files, simpler SQL-based setup	`COPY INTO`
Need schema evolution handling	Auto Loader
Idempotent batch loads from a known path	`COPY INTO`

2.3 Structured Streaming Triggers

A streaming job running 24/7 is expensive. Databricks provides trigger modes to control when and how a stream processes data.

trigger(availableNow=True) — The most tested trigger on the exam.

(df.writeStream
    .trigger(availableNow=True)
    .option("checkpointLocation", "/path/to/checkpoint")
    .table("silver_table")
)

Behavior: When the job runs, it processes all data that has accumulated since the last checkpoint in multiple micro-batches, then automatically shuts the stream down. It combines the efficiency of batch processing with the incremental awareness of streaming.

trigger(processingTime="5 minutes") — Processes a micro-batch on a fixed interval. The stream stays alive between batches.

trigger(once=True) — Deprecated predecessor to availableNow. Processed all available data in a single micro-batch, then shut down. availableNow is faster because it uses multiple micro-batches.

Module 3: Data Processing and Transformations (31%)

The largest module by weight. Covers Delta Lake internals, file management commands, and Spark SQL syntax for complex transformations.

3.1 Delta Lake Internals

Delta Lake is not a separate database engine. It is a storage layer built on top of Parquet files, enhanced by a transaction log.

Physical structure of a Delta table:

/my_delta_table/
    _delta_log/          ← JSON transaction log (the "brain")
        00000.json
        00001.json
        00002.json
    part-00001.parquet
    part-00002.parquet

Every write operation (INSERT, UPDATE, DELETE, MERGE) appends a new JSON entry to the _delta_log. This log is what enables ACID transactions, time travel, and schema enforcement.

3.2 Time Travel

Because every version of the table is recorded in the transaction log, you can query any historical state.

-- Query by timestamp
SELECT * FROM sales TIMESTAMP AS OF '2026-05-10 12:00:00';

-- Query by version number
SELECT * FROM sales VERSION AS OF 5;

-- Restore the entire table to a previous state
RESTORE TABLE sales TO VERSION AS OF 5;

-- View the full version history
DESCRIBE HISTORY sales;

3.3 VACUUM

Time travel requires keeping old data files on disk. VACUUM permanently deletes data files that are no longer referenced by the transaction log.

VACUUM sales;                          -- Uses default 7-day retention
VACUUM sales RETAIN 168 HOURS;         -- Explicit 7-day retention
VACUUM sales RETAIN 0 HOURS DRY RUN;   -- Preview what would be deleted

Critical exam trap: You cannot VACUUM data younger than 7 days (168 hours) by default. Attempting to do so will throw an error. This protection exists to prevent you from deleting files that active streaming queries or concurrent readers might still be referencing.

To override this protection (not recommended in production):

SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM sales RETAIN 0 HOURS;

3.4 OPTIMIZE and Z-ORDER

Over time, Delta tables accumulate thousands of tiny files from frequent small writes (a problem called the "small file problem"). OPTIMIZE solves this.

-- Compact small files into larger, efficient files (~1GB each)
OPTIMIZE sales;

-- Compact AND physically co-locate related data
OPTIMIZE sales ZORDER BY (store_id);

OPTIMIZE merges many small Parquet files into fewer large files, dramatically improving read performance and reducing metadata overhead.

Z-ORDER goes further: it physically sorts and co-locates rows with the same store_id value within the same files. When a query filters by store_id, Databricks can skip entire files that don't contain the requested value — a technique called data skipping. This can reduce the amount of data scanned by 80–95%.

When to use Z-ORDER: On columns that are frequently used in WHERE clauses, JOIN conditions, or GROUP BY operations. Limit to 1–4 columns for best results.

3.5 Spark SQL Higher-Order Functions

These functions operate on array columns (common in JSON-sourced data) and are directly tested on the exam.

FILTER(array, condition) — Returns a new array containing only elements that satisfy the condition.

-- Return only employees with more than 5 years of experience
SELECT FILTER(employees_array, emp -> emp.years_exp > 5) AS senior_employees
FROM stores;

TRANSFORM(array, expression) — Applies a transformation to every element in an array and returns a new array of the same length.

-- Apply a 10% raise to every salary in the array
SELECT TRANSFORM(salary_array, s -> s * 1.10) AS adjusted_salaries
FROM employees;

AGGREGATE(array, zero, merge, finish) — Reduces an array to a single value (similar to reduce in programming).

-- Sum all values in an array
SELECT AGGREGATE(scores_array, 0, (acc, x) -> acc + x) AS total_score
FROM students;

EXISTS(array, condition) — Returns true if at least one element in the array satisfies the condition.

-- Find stores that have at least one employee with a manager role
SELECT EXISTS(employees_array, emp -> emp.role = 'manager') AS has_manager
FROM stores;

3.6 Common Table Expressions and Window Functions

CTEs improve readability and are heavily used in multi-step SQL transformations:

WITH monthly_sales AS (
    SELECT store_id, DATE_TRUNC('month', sale_date) AS month, SUM(amount) AS total
    FROM sales
    GROUP BY store_id, month
),
ranked_stores AS (
    SELECT *, RANK() OVER (PARTITION BY month ORDER BY total DESC) AS rank
    FROM monthly_sales
)
SELECT * FROM ranked_stores WHERE rank <= 3;

Window functions compute values across a set of rows related to the current row without collapsing them into a single output row:

SELECT
    store_id,
    sale_date,
    amount,
    SUM(amount) OVER (PARTITION BY store_id ORDER BY sale_date
                      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
    LAG(amount, 1) OVER (PARTITION BY store_id ORDER BY sale_date) AS previous_sale
FROM sales;

Module 4: Productionizing Pipelines (18%)

Once your transformations work, you need to schedule, orchestrate, and automate them reliably. This module covers Databricks Workflows and Delta Live Tables.

4.1 Databricks Workflows (Jobs)

A Workflow is a directed acyclic graph (DAG) of tasks. Each task can be a Notebook, Python script, SQL query, dbt model, or a DLT pipeline.

Task dependency behavior:

Task A (Extract) --> Task B (Transform) --> Task C (Load)

If Task A succeeds, Task B runs normally.
If Task A fails, Task B is marked as "Upstream Failed" and does not execute.
Task C is also marked "Upstream Failed" by cascade.

Repair and Rerun: If Task A fails and you fix the root cause, you can use "Repair Run" to re-execute only the failed task and its downstream dependents — without re-running tasks that already succeeded.

Multi-task flexibility: A single workflow can orchestrate Python notebooks, SQL notebooks, dbt Core projects, and DLT pipelines all together in one DAG.

4.2 Delta Live Tables (DLT)

DLT is a declarative framework for building the Medallion Architecture. Instead of writing imperative code that manages cluster lifecycle, checkpoints, and dependencies manually, you declare what you want each table to contain and DLT handles the rest.

Syntax comparison:

# Bronze layer — Streaming Live Table (incremental, append-only)
@dlt.table(comment="Raw ingestion from cloud storage")
def bronze_orders():
    return (
        spark.readStream
            .format("cloudFiles")
            .option("cloudFiles.format", "json")
            .load("/raw/orders/")
    )

# Silver layer — Streaming Live Table with quality expectations
@dlt.table(comment="Cleaned and validated orders")
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
def silver_orders():
    return dlt.read_stream("bronze_orders").where("amount > 0")

# Gold layer — Live Table (full recompute, materialized view)
@dlt.table(comment="Aggregated daily revenue by store")
def gold_daily_revenue():
    return (
        dlt.read("silver_orders")
            .groupBy("store_id", "order_date")
            .agg({"amount": "sum"})
    )

Key distinction:

Table Type	SQL Syntax	Python Decorator	Behavior
Streaming Live Table	`CREATE STREAMING LIVE TABLE`	`@dlt.table` + `readStream`	Incremental, processes new data only
Live Table	`CREATE LIVE TABLE`	`@dlt.table` + `read`	Full recompute on every pipeline run

Execution modes:

Mode	Cost	Latency	Best For
Continuous	High (runs 24/7)	Near-zero	Real-time dashboards, fraud detection
Triggered	Low (runs on schedule)	Minutes	Nightly batch pipelines, cost-sensitive workloads

Module 5: Data Governance and Quality (11%)

Governance is about knowing who can see what data, and quality is about ensuring the data that flows through your pipelines is trustworthy.

5.1 Unity Catalog

Unity Catalog is the centralized, cross-workspace governance layer for Databricks. It provides a single place to manage access control, data lineage, and auditing across all workspaces in an organization.

Three-level namespace hierarchy:

Metastore
└── Catalog (e.g., main, dev, prod)
    └── Schema / Database (e.g., sales_db)
        └── Table / View / Volume

Granting access — you must grant at every level:

-- Step 1: Grant access to the catalog
GRANT USAGE ON CATALOG main TO data_team;

-- Step 2: Grant access to the specific database
GRANT USAGE ON DATABASE main.sales_db TO data_team;

-- Step 3: Grant the actual data permission
GRANT SELECT ON TABLE main.sales_db.orders TO data_team;

Exam trap: Granting SELECT on a table is not enough if the team does not also have USAGE on the parent catalog and schema. All three grants are required.

Common privilege types:

Privilege	Scope	Effect
`USAGE`	Catalog, Schema	Allows navigation into the object
`SELECT`	Table, View	Allows reading data
`MODIFY`	Table	Allows INSERT, UPDATE, DELETE
`CREATE`	Schema	Allows creating new tables
`ALL PRIVILEGES`	Any	Grants all applicable permissions

5.2 Delta Sharing

Delta Sharing is an open protocol built into Unity Catalog that allows you to share live, read-only data with external consumers — without copying the data and without requiring the recipient to have a Databricks account.

Supported recipients can use: Snowflake, BigQuery, Apache Spark, Pandas, Power BI, and any tool that supports the open Delta Sharing protocol.

Key properties:

Data is never copied — the recipient queries your Delta table directly through a secure token
You control exactly which tables, schemas, and even row/column subsets are shared
Access can be revoked instantly

5.3 DLT Data Quality Expectations

DLT provides three enforcement modes for data quality rules. Choosing the wrong one is a common exam mistake.

Mode 1 — Monitor only (keep bad rows):

@dlt.expect("valid_id", "id IS NOT NULL")

Bad rows are written to the table
The violation is recorded in the pipeline event log
The pipeline continues running
Use when: you want visibility into data quality issues without blocking the pipeline

Mode 2 — Drop bad rows:

@dlt.expect_or_drop("valid_id", "id IS NOT NULL")

Bad rows are silently dropped and never written
Good rows are written normally
The pipeline continues running
Use when: bad records are expected occasionally and downstream tables should only contain clean data

Mode 3 — Fail the pipeline:

@dlt.expect_or_fail("valid_id", "id IS NOT NULL")

Any bad row causes the entire pipeline to stop immediately
No data is written for that batch
Use when: data quality is non-negotiable (financial transactions, regulatory reporting) and writing bad data would corrupt downstream systems

Decision guide:

Expectation Mode	Bad Row Fate	Pipeline Fate	Use Case
`expect`	Kept, flagged	Continues	Monitoring and observability
`expect_or_drop`	Dropped	Continues	Normal data cleaning
`expect_or_fail`	Blocks write	Fails immediately	Critical, zero-tolerance data

Quick-Reference Cheat Sheet

This section consolidates the most exam-critical facts into a single scannable reference.

Architecture

Control Plane = Databricks' cloud (UI, notebooks, job scheduler)
Data Plane = Your cloud (VMs, S3/ADLS/GCS)
Serverless = Compute in Databricks' cloud, instant scale, no cluster management

Cluster Selection Rules

Automated job → Job Cluster (cheap, terminates after run)
Interactive development → All-Purpose Cluster (expensive, stays alive)
Fast cold starts → Cluster Pool (idle VMs on standby)

Ingestion Commands

# Auto Loader — high-volume incremental file ingestion
spark.readStream.format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/schema/path")
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")  # handle new columns
    .load("/raw/path")

# Streaming trigger — process all available data then stop
.trigger(availableNow=True)

-- COPY INTO — SQL-native idempotent batch load
COPY INTO target_table FROM '/raw/path' FILEFORMAT = CSV;

Delta Lake Commands

-- Time travel
SELECT * FROM table TIMESTAMP AS OF '2026-05-10';
SELECT * FROM table VERSION AS OF 5;
RESTORE TABLE table TO VERSION AS OF 5;
DESCRIBE HISTORY table;

-- File management
OPTIMIZE table ZORDER BY (column);   -- compact files + data skipping
VACUUM table RETAIN 168 HOURS;       -- delete old files (min 7 days default)

-- Cannot VACUUM below 7 days without disabling safety check
SET spark.databricks.delta.retentionDurationCheck.enabled = false;

Higher-Order Functions

FILTER(array_col, x -> x.field > value)       -- filter array elements
TRANSFORM(array_col, x -> x * 1.1)            -- apply function to each element
AGGREGATE(array_col, 0, (acc, x) -> acc + x)  -- reduce array to single value
EXISTS(array_col, x -> x.field = 'value')     -- boolean: any element matches

Unity Catalog Grants

-- Always grant at catalog, schema, AND table level
GRANT USAGE ON CATALOG catalog_name TO role;
GRANT USAGE ON DATABASE catalog.schema TO role;
GRANT SELECT ON TABLE catalog.schema.table TO role;

DLT Expectations

@dlt.expect("rule_name", "condition")           # keep bad rows, log violation
@dlt.expect_or_drop("rule_name", "condition")   # drop bad rows, continue
@dlt.expect_or_fail("rule_name", "condition")   # bad row = pipeline failure

DLT Table Types

# Streaming Live Table — incremental, Bronze/Silver
@dlt.table
def my_table():
    return spark.readStream...

# Live Table — full recompute, Gold aggregations
@dlt.table
def my_aggregate():
    return dlt.read("upstream_table").groupBy(...)

Git in Databricks Repos

Supported: clone, pull, push, commit, branch checkout
Not supported in UI: merge conflict resolution, pull requests

VACUUM Retention Rule

Default minimum retention: 7 days / 168 hours
Attempting to vacuum below this threshold throws an error
Override requires explicitly disabling the retention check (not recommended in production)

event_available

Day 1 of 5 72 questions/day Finish by May 22, 2026

Go to Day 1

list_alt

Revision Queue

Wrong Only

Unanswered Only

Needs Attention

5 topics

Question 13 of 322

Data Processing & Transformations · 47%

A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week is equal to 1 and the Python variable review_period is True. Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?

DESCRIBE LOCATION customer360;

DROP DATABASE customer360;

DESCRIBE DATABASE customer360;

ALTER DATABASE customer360 SET DBPROPERTIES ('location' = '/user' ;

USE DATABASE customer360;

arrow_back Prev

0 correct

0 wrong

322 left

4% done

grid_view

Questions

13/322

Correct Wrong Current Unseen Flagged Locked

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

15 free · 307 locked

Keyboard Shortcuts

1-5 Select option

↵ Check answer

←→ Prev / Next

F Flag for review

track_changes

Plan Progress

Quick Stats

Answered 0/322

Correct 0

Wrong 0

Remaining 322