Domain 1 โ Developing and Optimizing dbt Models
Concepts
This is the highest-yield domain. You must know how dbt turns modular SQL files into a dependency graph and production data objects.
Key resource types:
| Resource |
What it represents |
Common exam signal |
| Model |
A SQL or Python transformation managed by dbt |
Build clean staging/intermediate/mart layers |
| Source |
Raw data object loaded outside dbt |
Use when referring to raw tables |
| Seed |
Static CSV version-controlled in the project |
Small lookup/mapping/reference data |
| Snapshot |
Point-in-time history of mutable source records |
SCD-style history where source overwrites changes |
| Macro |
Reusable Jinja logic |
Repeated SQL pattern or generated logic |
| Test |
Data assertion |
Validate uniqueness, not null, relationships, accepted values, business rules |
| Exposure |
Downstream asset such as BI dashboard, notebook, ML job |
Show downstream dependency and ownership |
| Package |
External reusable dbt project |
Shared macros/models/tests from dbt Hub or Git |
ref() and source()
ref() is for dbt models. source() is for raw objects loaded outside dbt.
| Need |
Use |
Why |
A mart depends on stg_orders |
{{ ref('stg_orders') }} |
Creates DAG dependency and environment-aware relation name |
A staging model reads raw stripe.payments |
{{ source('stripe', 'payments') }} |
Documents raw dependency and supports source tests/freshness |
A model directly queries analytics_prod.stg_orders |
Replace with ref() |
Hard-coding breaks lineage and environment portability |
A model directly queries raw.shopify.orders |
Replace with source() |
Raw dependencies belong in source YAML |
Exam trap: If the question says models build in the wrong order, do not choose a scheduler workaround. dbt order comes from ref() dependencies.
Materializations
| Materialization |
Use when |
Avoid when |
Exam trap |
| View |
Logic should stay lightweight and always query fresh upstream data |
Heavy repeated dashboard queries need fast performance |
View does not store results; it can push cost to query time |
| Table |
Model is expensive to compute and queried often |
Data changes frequently and full rebuild is too expensive |
Table rebuilds entire relation each run |
| Incremental |
Large table, small new/changed subset per run |
Large percentage updates each run or logic cannot isolate changes |
Requires correct filter and unique key/strategy where needed |
| Ephemeral |
Reusable intermediate logic not queried directly |
Many downstream refs create repeated SQL; business users need to query it |
Ephemeral is inlined as CTE, not created as a database object |
| Seed |
Small static CSV controlled in git |
Large dynamic data or frequently updated operational data |
Seeds are not an ingestion system |
| Snapshot |
Track historical changes in mutable source records |
You only need latest state or immutable event data |
Snapshot is not a materialization for performance |
Incremental models
Use incremental when a model has many rows and only a small subset is added or changed each run.
Core reasoning:
is_incremental() gates logic that should only run on incremental runs.
- The SQL must be valid for both full-refresh and incremental runs.
- Use a reliable event/update timestamp or high-water mark.
- Use
unique_key and an incremental strategy when records can update.
- Use
--full-refresh when logic changes require rebuilding historical rows.
- Schema changes may require
on_schema_change handling or full refresh depending on the warehouse and change type.
Common bad answers:
- โIncremental models are always rebuilt.โ False.
- โIncremental models are always best for large tables.โ Not if most rows change each run.
- โYou never need
is_incremental().โ Usually false for selective processing.
- โUse ephemeral to make a large dashboard model faster.โ Usually wrong; ephemeral can duplicate heavy SQL.
Sources
A source maps to a raw data location, commonly database + schema, with tables underneath. Use sources to centralize raw object naming and document external dependencies.
Good source YAML includes:
- source name;
- database/schema where needed;
- tables;
- source and column descriptions;
- tests on raw data assumptions;
- freshness where timeliness matters;
- loaded timestamp field for freshness checks.
Exam trap: If multiple raw tables are in the same database/schema, they are usually one source with multiple tables, not multiple sources.
Modularity and DRY SQL
Good dbt modeling decomposes SQL into layers:
| Layer |
Typical purpose |
Typical materialization |
| Staging |
Clean, rename, cast, standardize one source |
View, sometimes ephemeral/table |
| Intermediate |
Reusable transformations and joins |
Ephemeral, view, table depending on cost |
| Mart |
Business-facing facts/dimensions |
Table/incremental for performance |
Use macros for reusable logic patterns, not for hiding business-critical model lineage. Use models when you need DAG visibility and testable transformation steps.
Jinja, variables, and environment config
| Feature |
Use case |
Trap |
var() |
Project variables provided in dbt_project.yml or CLI |
Do not store secrets in vars |
env_var() |
Environment-specific values and secrets |
Must be available in runtime environment |
target |
Branch logic by target/profile |
Avoid excessive target-specific model logic that makes behavior hard to test |
config() |
Model-level configs in SQL |
Remember config precedence |
dbt_project.yml |
Default project/folder configs |
Bad indentation or wrong resource path breaks expectations |
Python models
Python models are used for transformations easier in Python than SQL, such as advanced data science-style transformations. Exam reasoning remains dbt-native:
- They are still models in the DAG.
- They can use
dbt.ref() and dbt.source().
- They are not a replacement for simple SQL transformations.
- Support depends on the adapter/platform.
Git workflow
Expected git skills:
- create feature branches;
- commit changes;
- pull from main/head branch to stay updated;
- resolve conflicts;
- open pull requests;
- use CI before merging.
Exam trap: If the question says your branch is behind main, the correct general action is to pull/reconcile with the head branch, not manually copy files or merge straight into production.
Patterns
- Raw table reference โ define source + use
source().
- Model-to-model dependency โ use
ref().
- Repeated business logic โ refactor into staging/intermediate models or macros.
- Expensive dashboard model โ table or incremental.
- Append-only high-volume data โ incremental.
- Static mapping table โ seed.
- Source overwrites values but history needed โ snapshot.
- Direct warehouse object permissions required โ grants.
Traps
- Choosing a materialization only because it is โfasterโ without considering freshness, cost, and query pattern.
- Using hard-coded schema names in dbt models.
- Making every model a table, which increases rebuild time/storage.
- Making every staging model ephemeral, which can duplicate heavy SQL downstream.
- Treating seeds as ingestion for operational data.
- Using macros where a model would provide better lineage and testing.
- Forgetting that a table model fully rebuilds by default.