The Databricks Certified Data Engineer Associate exam tests whether you can work effectively on the Databricks platform. Beyond service names, you need to know best practices for Delta Lake, Unity Catalog, Spark performance, and orchestration. This guide covers the patterns that show up most often.
Delta Lake Best Practices Delta Lake is the default storage layer. The exam expects you to prefer Delta tables over raw files or unmanaged folders in almost every scenario. Why Delta over raw files: | Delta Lake Benefit | What It Means | |
|
| | ACID transactions | Reliable reads and writes, even with concurrent access | | Schema enforcement | Prevents bad data from being written | | Schema evolution | Add columns without breaking existing pipelines | | Time travel | Query previous versions of data | | Audit history | Track all changes via transaction log | The exam pattern: "A team stores curated data as Parquet files in cloud storage. They need ACID transactions and time travel." Move to Delta Lake. Raw Parquet files don't provide transaction semantics. OPTIMIZE and Z-ORDER:: OPTIMIZE compacts small files into larger ones for faster reads: Z-ORDER co-locates related data within files for better pruning: Run OPTIMIZE after large writes or when query performance degrades VACUUM: Removes old Delta files beyond the retention period. Default minimum is 7 hours. Setting retention too low risks breaking time travel.
Unity Catalog Governance Unity Catalog is the central governance layer. The exam tests it heavily: | Concept | What It Is | Exam Rule | |
|
|
| | Catalog | Top-level container (usually maps to a business unit or environment) | Organize by team or project | | Schema | Container within a catalog (database equivalent) | Group related tables | | Managed table | Databricks-managed storage and lifecycle | Prefer for most use cases | | External table | Customer-managed storage location | Use when data must stay in a specific location | | Volume | Storage for non-tabular data (files, models) | Use instead of DBFS for files | Managed vs External tables:: Managed: Databricks controls the storage. Dropping the table deletes the data.: External: You control the storage location. Dropping the table only removes the metadata. The exam pattern: "A company wants governed tables with centralized access control and lineage tracking." Unity Catalog managed tables. Not raw files in DBFS. Security in Unity Catalog:: GRANT/REVOKE on catalog, schema, table, or column level: Row filters filter rows based on user/group: Column masks mask column values based on user/group: ABAC (Attribute-Based Access Control) using tags and policies Least privilege pattern: Grant USAGE on a catalog, SELECT on specific tables. Not "grant all privileges on everything."
Spark Performance Tuning The exam tests Spark performance diagnosis through Spark UI metrics: | Metric | Problem | Fix | |
|
|
| | Data skew | One task processes most data | Repartition, use salting, or AQE skew join handling | | Excessive shuffle | Too much data moved between stages | Use broadcast joins, reduce data before joining | | Spill to disk | Executor memory too small | Increase executor memory or reduce partition count | | Long GC time | Too much object creation/reclamation | Cache selectively, reduce shuffle | | Many small files | Inefficient reads | OPTIMIZE (compaction) | Broadcast joins: When one table is small (<10 MB default), broadcast it to all executors. This avoids shuffling the large table. The exam uses this as the fix for "join is slow and causes large shuffle." Adaptive Query Execution (AQE): Spark's runtime optimization. Handles skew joins, coalesces shuffle partitions, and can change join strategies at runtime. The exam tests whether you know AQE exists and what it does. Caching strategy: Cache DataFrames that are accessed multiple times within a notebook. Don't cache everything: excess caching wastes memory.
Lakeflow Jobs Lakeflow Jobs replace the older "Jobs" concept. The exam tests: | Feature | What It Does | |
|
| | Task dependencies | Task B runs after Task A completes | | Scheduling | Cron-based or continuous | | Triggers | File arrival triggers, table update triggers | | Retries | Automatic retry on failure | | Parameters | Pass values between tasks or from job config | | Notifications | Alert on success/failure | The exam pattern: "A company needs a transformation to run after new files arrive in cloud storage." Lakeflow Job with a file arrival trigger. Not a scheduled job that runs every hour. CI/CD with Databricks Asset Bundles: 1. Define jobs, notebooks, and configurations as code 2. Store in Git repository 3. Branches for dev, test, prod 4. Pull requests for code review 5. Deploy bundles to different environments The exam pattern: "A team wants to promote a transformation from dev to production with version control and code reviews." Git Folders + branches + pull requests + Asset Bundles. Not "manual notebook copy."
Ingestion Patterns | Method | Use When | Avoid When | |
|
|
| | COPY INTO | Incremental file loads from cloud storage, simple file tracking | Continuous ingestion, complex schema evolution | | Auto Loader | Scalable file ingestion with checkpointing, schema inference | Simple one-off loads | | Lakeflow Connect | Managed connectors for SaaS, databases | Sources not supported by connectors | | JDBC | Direct database connections | Large-scale bulk loads (use Auto Loader instead) | The exam pattern: "A company needs to incrementally load CSV files from S3 into a Delta table, tracking which files have been processed." COPY INTO or Auto Loader. For simple incremental loads, COPY INTO. For schema evolution and checkpointing, Auto Loader.