Cert-Pass
Log in Sign up
arrow_back Cert

Databricks Data Engineer Associate

🔥 0 streak
0%
timer Mock download Free
menu_book

Data Engineer Associate

Comprehensive Study Guide

lock

Study Guide Preview

Unlock the full study guide with a Q&A + Course purchase.

View Plans

The Comprehensive Databricks Data Engineer Associate Masterclass

This masterclass covers every major domain tested on the Databricks Data Engineer Associate exam, organized by module weight. Each section combines conceptual explanation with precise syntax, exam traps, and a cheat sheet summary so you can study efficiently and confidently.


Module 1: The Databricks Intelligence Platform (10%)

The foundation of Databricks is its separation of compute from storage — understanding this architecture is essential before anything else makes sense.

1.1 The Two-Plane Architecture

Databricks does not store your data. It processes it. The platform is split into two distinct planes:

Control Plane — Hosted in Databricks' own cloud account. Manages:

  • The web UI and notebook editor
  • Job scheduler and cluster configuration
  • Code you write (notebooks are stored here)

Data Plane — Hosted in your cloud account (AWS, Azure, or GCP). Contains:

  • The virtual machines (clusters) that execute your code
  • Your actual data storage (S3, ADLS, GCS)

Serverless Compute — A newer model where Databricks moves the compute infrastructure entirely into their own environment. You no longer wait 5–10 minutes for a cluster to boot, and you never manage instance types. It scales instantly and is billed per second of use.


1.2 FinOps and Cluster Types

The exam heavily tests your ability to choose the most cost-efficient cluster for a given scenario.

Cluster Type Use Case Billing Rate Key Trait
All-Purpose Interactive notebooks, ad-hoc queries Premium (expensive) Stays alive between runs
Job Cluster Automated pipelines Discounted (cheap) Spins up, runs, terminates
Cluster Pool Reducing cold-start time Idle VM cost only Drops start time from ~5 min to ~10 sec

Exam Rule: For any question involving automated or scheduled tasks, always select Job Cluster. All-Purpose clusters are for humans interacting with notebooks in real time.

Cluster Pools keep a set of virtual machines powered on but idle. When a job needs to start, it pulls from the pool instead of provisioning new VMs from scratch — dramatically reducing cold-start latency.


1.3 Git Integration and Databricks Repos

Databricks Repos allow you to sync your workspace directly with GitHub, GitLab, or Bitbucket.

Supported operations inside the Databricks UI:

  • Clone a repository
  • Pull latest changes
  • Push commits
  • Create and checkout branches
  • Commit changes

Not supported (exam trap): You cannot resolve merge conflicts or open Pull Requests inside the Databricks UI. These operations must be performed in your Git provider (GitHub, GitLab, etc.).


Module 2: Development and Data Ingestion (30%)

This is the highest-weighted domain alongside Module 3. The focus is on the syntax and mechanics of getting raw files into Delta format reliably and incrementally.

2.1 Auto Loader (cloudFiles)

Auto Loader is Databricks' flagship tool for ingesting files from cloud storage incrementally. It is designed to handle millions of files (JSON, CSV, Parquet, Avro, etc.) without reprocessing files it has already seen.

Core mechanism: Auto Loader uses cloud-native file notification services (e.g., AWS SQS + SNS) to detect new files as they land, rather than listing the entire directory on every run.

Syntax you must know:

(spark.readStream
    .format("cloudFiles")                              # Auto Loader always uses this format
    .option("cloudFiles.format", "json")               # The actual file format on disk
    .option("cloudFiles.schemaLocation", "/path/to/schema")  # Required for schema inference
    .load("/data/raw_json/")
    .writeStream
    .option("checkpointLocation", "/path/to/checkpoint")
    .table("bronze_table")
)

Schema Evolution: If a new column appears in tomorrow's JSON files, Auto Loader will fail by default. To safely absorb new columns without breaking the pipeline:

.option("cloudFiles.schemaEvolutionMode", "addNewColumns")

This tells Auto Loader to automatically update the schema of the target table when new columns are detected, rather than throwing an error.


2.2 The COPY INTO Command

COPY INTO is the SQL-native alternative to Auto Loader for incremental file loading.

Key property — Idempotency: COPY INTO maintains an internal record of every file it has already processed. Running the same command multiple times will never duplicate data.

COPY INTO target_delta_table
FROM '/path/to/raw/data/'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true');

If you schedule this command to run every hour, it will only load files that arrived since the last run. Files already loaded are permanently tracked and skipped.

Auto Loader vs. COPY INTO — When to use which:

Scenario Recommended Tool
Millions of files, high-volume ingestion Auto Loader (cloudFiles)
Thousands of files, simpler SQL-based setup COPY INTO
Need schema evolution handling Auto Loader
Idempotent batch loads from a known path COPY INTO

2.3 Structured Streaming Triggers

A streaming job running 24/7 is expensive. Databricks provides trigger modes to control when and how a stream processes data.

trigger(availableNow=True) — The most tested trigger on the exam.

(df.writeStream
    .trigger(availableNow=True)
    .option("checkpointLocation", "/path/to/checkpoint")
    .table("silver_table")
)

Behavior: When the job runs, it processes all data that has accumulated since the last checkpoint in multiple micro-batches, then automatically shuts the stream down. It combines the efficiency of batch processing with the incremental awareness of streaming.

trigger(processingTime="5 minutes") — Processes a micro-batch on a fixed interval. The stream stays alive between batches.

trigger(once=True) — Deprecated predecessor to availableNow. Processed all available data in a single micro-batch, then shut down. availableNow is faster because it uses multiple micro-batches.


Module 3: Data Processing and Transformations (31%)

The largest module by weight. Covers Delta Lake internals, file management commands, and Spark SQL syntax for complex transformations.

3.1 Delta Lake Internals

Delta Lake is not a separate database engine. It is a storage layer built on top of Parquet files, enhanced by a transaction log.

Physical structure of a Delta table:

/my_delta_table/
    _delta_log/          ← JSON transaction log (the "brain")
        00000.json
        00001.json
        00002.json
    part-00001.parquet
    part-00002.parquet

Every write operation (INSERT, UPDATE, DELETE, MERGE) appends a new JSON entry to the _delta_log. This log is what enables ACID transactions, time travel, and schema enforcement.


3.2 Time Travel

Because every version of the table is recorded in the transaction log, you can query any historical state.

-- Query by timestamp
SELECT * FROM sales TIMESTAMP AS OF '2026-05-10 12:00:00';

-- Query by version number
SELECT * FROM sales VERSION AS OF 5;

-- Restore the entire table to a previous state
RESTORE TABLE sales TO VERSION AS OF 5;

-- View the full version history
DESCRIBE HISTORY sales;

3.3 VACUUM

Time travel requires keeping old data files on disk. VACUUM permanently deletes data files that are no longer referenced by the transaction log.

VACUUM sales;                          -- Uses default 7-day retention
VACUUM sales RETAIN 168 HOURS;         -- Explicit 7-day retention
VACUUM sales RETAIN 0 HOURS DRY RUN;   -- Preview what would be deleted

Critical exam trap: You cannot VACUUM data younger than 7 days (168 hours) by default. Attempting to do so will throw an error. This protection exists to prevent you from deleting files that active streaming queries or concurrent readers might still be referencing.

To override this protection (not recommended in production):

SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM sales RETAIN 0 HOURS;

3.4 OPTIMIZE and Z-ORDER

Over time, Delta tables accumulate thousands of tiny files from frequent small writes (a problem called the "small file problem"). OPTIMIZE solves this.

-- Compact small files into larger, efficient files (~1GB each)
OPTIMIZE sales;

-- Compact AND physically co-locate related data
OPTIMIZE sales ZORDER BY (store_id);

OPTIMIZE merges many small Parquet files into fewer large files, dramatically improving read performance and reducing metadata overhead.

Z-ORDER goes further: it physically sorts and co-locates rows with the same store_id value within the same files. When a query filters by store_id, Databricks can skip entire files that don't contain the requested value — a technique called data skipping. This can reduce the amount of data scanned by 80–95%.

When to use Z-ORDER: On columns that are frequently used in WHERE clauses, JOIN conditions, or GROUP BY operations. Limit to 1–4 columns for best results.


3.5 Spark SQL Higher-Order Functions

These functions operate on array columns (common in JSON-sourced data) and are directly tested on the exam.

FILTER(array, condition) — Returns a new array containing only elements that satisfy the condition.

-- Return only employees with more than 5 years of experience
SELECT FILTER(employees_array, emp -> emp.years_exp > 5) AS senior_employees
FROM stores;

TRANSFORM(array, expression) — Applies a transformation to every element in an array and returns a new array of the same length.

-- Apply a 10% raise to every salary in the array
SELECT TRANSFORM(salary_array, s -> s * 1.10) AS adjusted_salaries
FROM employees;

AGGREGATE(array, zero, merge, finish) — Reduces an array to a single value (similar to reduce in programming).

-- Sum all values in an array
SELECT AGGREGATE(scores_array, 0, (acc, x) -> acc + x) AS total_score
FROM students;

EXISTS(array, condition) — Returns true if at least one element in the array satisfies the condition.

-- Find stores that have at least one employee with a manager role
SELECT EXISTS(employees_array, emp -> emp.role = 'manager') AS has_manager
FROM stores;

3.6 Common Table Expressions and Window Functions

CTEs improve readability and are heavily used in multi-step SQL transformations:

WITH monthly_sales AS (
    SELECT store_id, DATE_TRUNC('month', sale_date) AS month, SUM(amount) AS total
    FROM sales
    GROUP BY store_id, month
),
ranked_stores AS (
    SELECT *, RANK() OVER (PARTITION BY month ORDER BY total DESC) AS rank
    FROM monthly_sales
)
SELECT * FROM ranked_stores WHERE rank <= 3;

Window functions compute values across a set of rows related to the current row without collapsing them into a single output row:

SELECT
    store_id,
    sale_date,
    amount,
    SUM(amount) OVER (PARTITION BY store_id ORDER BY sale_date
                      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
    LAG(amount, 1) OVER (PARTITION BY store_id ORDER BY sale_date) AS previous_sale
FROM sales;

Module 4: Productionizing Pipelines (18%)

Once your transformations work, you need to schedule, orchestrate, and automate them reliably. This module covers Databricks Workflows and Delta Live Tables.

4.1 Databricks Workflows (Jobs)

A Workflow is a directed acyclic graph (DAG) of tasks. Each task can be a Notebook, Python script, SQL query, dbt model, or a DLT pipeline.

Task dependency behavior:

Task A (Extract) --> Task B (Transform) --> Task C (Load)
  • If Task A succeeds, Task B runs normally.
  • If Task A fails, Task B is marked as "Upstream Failed" and does not execute.
  • Task C is also marked "Upstream Failed" by cascade.

Repair and Rerun: If Task A fails and you fix the root cause, you can use "Repair Run" to re-execute only the failed task and its downstream dependents — without re-running tasks that already succeeded.

Multi-task flexibility: A single workflow can orchestrate Python notebooks, SQL notebooks, dbt Core projects, and DLT pipelines all together in one DAG.


4.2 Delta Live Tables (DLT)

DLT is a declarative framework for building the Medallion Architecture. Instead of writing imperative code that manages cluster lifecycle, checkpoints, and dependencies manually, you declare what you want each table to contain and DLT handles the rest.

Syntax comparison:

# Bronze layer — Streaming Live Table (incremental, append-only)
@dlt.table(comment="Raw ingestion from cloud storage")
def bronze_orders():
    return (
        spark.readStream
            .format("cloudFiles")
            .option("cloudFiles.format", "json")
            .load("/raw/orders/")
    )

# Silver layer — Streaming Live Table with quality expectations
@dlt.table(comment="Cleaned and validated orders")
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
def silver_orders():
    return dlt.read_stream("bronze_orders").where("amount > 0")

# Gold layer — Live Table (full recompute, materialized view)
@dlt.table(comment="Aggregated daily revenue by store")
def gold_daily_revenue():
    return (
        dlt.read("silver_orders")
            .groupBy("store_id", "order_date")
            .agg({"amount": "sum"})
    )

Key distinction:

Table Type SQL Syntax Python Decorator Behavior
Streaming Live Table CREATE STREAMING LIVE TABLE @dlt.table + readStream Incremental, processes new data only
Live Table CREATE LIVE TABLE @dlt.table + read Full recompute on every pipeline run

Execution modes:

Mode Cost Latency Best For
Continuous High (runs 24/7) Near-zero Real-time dashboards, fraud detection
Triggered Low (runs on schedule) Minutes Nightly batch pipelines, cost-sensitive workloads

Module 5: Data Governance and Quality (11%)

Governance is about knowing who can see what data, and quality is about ensuring the data that flows through your pipelines is trustworthy.

5.1 Unity Catalog

Unity Catalog is the centralized, cross-workspace governance layer for Databricks. It provides a single place to manage access control, data lineage, and auditing across all workspaces in an organization.

Three-level namespace hierarchy:

Metastore
└── Catalog (e.g., main, dev, prod)
    └── Schema / Database (e.g., sales_db)
        └── Table / View / Volume

Granting access — you must grant at every level:

-- Step 1: Grant access to the catalog
GRANT USAGE ON CATALOG main TO data_team;

-- Step 2: Grant access to the specific database
GRANT USAGE ON DATABASE main.sales_db TO data_team;

-- Step 3: Grant the actual data permission
GRANT SELECT ON TABLE main.sales_db.orders TO data_team;

Exam trap: Granting SELECT on a table is not enough if the team does not also have USAGE on the parent catalog and schema. All three grants are required.

Common privilege types:

Privilege Scope Effect
USAGE Catalog, Schema Allows navigation into the object
SELECT Table, View Allows reading data
MODIFY Table Allows INSERT, UPDATE, DELETE
CREATE Schema Allows creating new tables
ALL PRIVILEGES Any Grants all applicable permissions

5.2 Delta Sharing

Delta Sharing is an open protocol built into Unity Catalog that allows you to share live, read-only data with external consumers — without copying the data and without requiring the recipient to have a Databricks account.

Supported recipients can use: Snowflake, BigQuery, Apache Spark, Pandas, Power BI, and any tool that supports the open Delta Sharing protocol.

Key properties:

  • Data is never copied — the recipient queries your Delta table directly through a secure token
  • You control exactly which tables, schemas, and even row/column subsets are shared
  • Access can be revoked instantly

5.3 DLT Data Quality Expectations

DLT provides three enforcement modes for data quality rules. Choosing the wrong one is a common exam mistake.

Mode 1 — Monitor only (keep bad rows):

@dlt.expect("valid_id", "id IS NOT NULL")
  • Bad rows are written to the table
  • The violation is recorded in the pipeline event log
  • The pipeline continues running
  • Use when: you want visibility into data quality issues without blocking the pipeline

Mode 2 — Drop bad rows:

@dlt.expect_or_drop("valid_id", "id IS NOT NULL")
  • Bad rows are silently dropped and never written
  • Good rows are written normally
  • The pipeline continues running
  • Use when: bad records are expected occasionally and downstream tables should only contain clean data

Mode 3 — Fail the pipeline:

@dlt.expect_or_fail("valid_id", "id IS NOT NULL")
  • Any bad row causes the entire pipeline to stop immediately
  • No data is written for that batch
  • Use when: data quality is non-negotiable (financial transactions, regulatory reporting) and writing bad data would corrupt downstream systems

Decision guide:

Expectation Mode Bad Row Fate Pipeline Fate Use Case
expect Kept, flagged Continues Monitoring and observability
expect_or_drop Dropped Continues Normal data cleaning
expect_or_fail Blocks write Fails immediately Critical, zero-tolerance data

Quick-Reference Cheat Sheet

This section consolidates the most exam-critical facts into a single scannable reference.

Architecture

  • Control Plane = Databricks' cloud (UI, notebooks, job scheduler)
  • Data Plane = Your cloud (VMs, S3/ADLS/GCS)
  • Serverless = Compute in Databricks' cloud, instant scale, no cluster management

Cluster Selection Rules

  • Automated job → Job Cluster (cheap, terminates after run)
  • Interactive development → All-Purpose Cluster (expensive, stays alive)
  • Fast cold starts → Cluster Pool (idle VMs on standby)

Ingestion Commands

# Auto Loader — high-volume incremental file ingestion
spark.readStream.format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/schema/path")
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")  # handle new columns
    .load("/raw/path")

# Streaming trigger — process all available data then stop
.trigger(availableNow=True)
-- COPY INTO — SQL-native idempotent batch load
COPY INTO target_table FROM '/raw/path' FILEFORMAT = CSV;

Delta Lake Commands

-- Time travel
SELECT * FROM table TIMESTAMP AS OF '2026-05-10';
SELECT * FROM table VERSION AS OF 5;
RESTORE TABLE table TO VERSION AS OF 5;
DESCRIBE HISTORY table;

-- File management
OPTIMIZE table ZORDER BY (column);   -- compact files + data skipping
VACUUM table RETAIN 168 HOURS;       -- delete old files (min 7 days default)

-- Cannot VACUUM below 7 days without disabling safety check
SET spark.databricks.delta.retentionDurationCheck.enabled = false;

Higher-Order Functions

FILTER(array_col, x -> x.field > value)       -- filter array elements
TRANSFORM(array_col, x -> x * 1.1)            -- apply function to each element
AGGREGATE(array_col, 0, (acc, x) -> acc + x)  -- reduce array to single value
EXISTS(array_col, x -> x.field = 'value')     -- boolean: any element matches

Unity Catalog Grants

-- Always grant at catalog, schema, AND table level
GRANT USAGE ON CATALOG catalog_name TO role;
GRANT USAGE ON DATABASE catalog.schema TO role;
GRANT SELECT ON TABLE catalog.schema.table TO role;

DLT Expectations

@dlt.expect("rule_name", "condition")           # keep bad rows, log violation
@dlt.expect_or_drop("rule_name", "condition")   # drop bad rows, continue
@dlt.expect_or_fail("rule_name", "condition")   # bad row = pipeline failure

DLT Table Types

# Streaming Live Table — incremental, Bronze/Silver
@dlt.table
def my_table():
    return spark.readStream...

# Live Table — full recompute, Gold aggregations
@dlt.table
def my_aggregate():
    return dlt.read("upstream_table").groupBy(...)

Git in Databricks Repos

  • Supported: clone, pull, push, commit, branch checkout
  • Not supported in UI: merge conflict resolution, pull requests

VACUUM Retention Rule

  • Default minimum retention: 7 days / 168 hours
  • Attempting to vacuum below this threshold throws an error
  • Override requires explicitly disabling the retention check (not recommended in production)
workspace_premium
Free Preview: 15 of 322 Questions 307 questions locked : these will appear on exam day. Click any question to preview.
rocket_launch Unlock All
event_available
Day 1 of 5 72 questions/day Finish by May 22, 2026
Question 13 of 322
Data Processing & Transformations · 47%

A data engineer only wants to execute the final block of a Python program if the Python variable day_of_week is equal to 1 and the Python variable review_period is True. Which of the following control flow statements should the data engineer use to begin this conditionally executed code block?

0 correct
0 wrong
322 left
4% done