Cert-Pass
Log in Sign up
Azure calendar_todayMay 30, 2026 schedule8 min read

DP-700 Tutorial: Build a Microsoft Fabric Lakehouse

Learn how to build a complete Microsoft Fabric lakehouse solution with pipelines, notebooks, and semantic models. Hands-on DP-700 tutorial for data engineers.

dp-700 microsoft-fabric certification
Azure

Azure Certification

View exams
Azure

DP-700 Microsoft Fabric Data Engineer Associate

Practice Now
DP-700 Tutorial: Build a Microsoft Fabric Lakehouse

So you are studying for the DP-700 and you want to get your hands dirty with Microsoft Fabric.. If you searched for DP-700 tutorial, you're in the right place Reading documentation is fine, but building something real is what cements the knowledge. Let us walk through creating a complete lakehouse solution from scratch: ingestion, transformation, orchestration, and serving.

What We Are Building

We will build a customer analytics lakehouse that ingests CSV files from Azure Blob Storage, transforms raw data into curated Delta tables using a PySpark notebook orchestrated by a pipeline, and serves the results through a semantic model for Power BI consumption.

This scenario covers the most common DP-700 exam patterns: lakehouses, pipelines, notebooks, Dataflows Gen2, and semantic models.

Step 1: Create Your Workspace and Lakehouse

Every Fabric solution starts with a workspace. Create a new workspace in the Fabric portal (app.fabric.microsoft.com) and assign it to the appropriate capacity. Inside the workspace, create a Lakehouse called "CustomerAnalytics."

When you create a lakehouse, Fabric automatically provisions:

  • A OneLake-backed storage area (files and tables in Delta format)
  • A SQL analytics endpoint for T-SQL queries
  • A default schema for table creation

Upload your raw customer CSV files to the lakehouse Files folder. In the Fabric portal, navigate to your lakehouse, click "Upload files," and place them in a folder called raw/customers/.

The bronze layer pattern puts raw files exactly where they land, without transformation. This gives you a replayable source of truth.

Fabric Item Best For When to Choose
Lakehouse Raw files, PySpark transformations, Delta tables Code-first data engineering with Spark
Warehouse Relational analytics, T-SQL, dimensional models SQL-based analytics serving Power BI
Notebook Complex PySpark/SQL transformations Custom logic, iterative development
Pipeline Orchestration, scheduling, multi-step workflow Coordinating ingestion and transformation
Dataflow Gen2 Low-code Power Query transformations Business analyst-driven ETL
Eventhouse Real-time analytics on streaming data KQL queries on event/telemetry data

Step 2: Build a Pipeline for Ingestion Orchestration

Pipelines are Fabric's orchestration engine. They handle scheduling, dependencies, parameters, and activity sequencing. Here is how to build one for our scenario.

Create a new Data Pipeline called "Ingest and Transform Customer Data." Add these activities:

Activity 1: Get Metadata Point this at the raw/customers/ folder to list available files. This makes the pipeline dynamic: it will pick up new files automatically.

Activity 2: ForEach Loop Loop over each file from the Get Metadata output. Inside the loop, add:

Activity 3: Notebook Call a PySpark notebook (we will create it in the next step) that reads the raw CSV, cleans it, and writes to a Delta table in the Tables area of the lakehouse.

Configure the pipeline to run daily using a schedule trigger. Set parameters for the source folder path so you can reuse this pipeline for different data sources.

Step 3: Write the Transformation Notebook

Notebooks in Fabric support PySpark, SQL, and Scala. For most DP-700 scenarios, PySpark is the default choice because it handles complex transformations, schema evolution, and file format conversion.

Create a new Notebook called "Transform Customer Data." Here is the logic it should implement:

Read raw data: Use Spark to read CSV files from the lakehouse Files area. The path pattern is something like: /lakehouse/default/Files/raw/customers/

Apply transformations:

  • Standardize column names (lowercase, underscores instead of spaces)
  • Parse date fields into proper timestamp types
  • Handle missing values: fill numeric nulls with 0, string defaults with "Unknown"
  • Deduplicate records based on customer ID
  • Add audit columns: ingestion_timestamp, source_file_name

Write curated output: Write the cleansed data as a Delta table in the lakehouse Tables area. Use merge logic (upsert) instead of overwrite so the notebook can be re-run without losing existing data.

The Bronze/Silver/Gold pattern is the architectural backbone here. Raw data lands in the bronze layer (files), gets cleansed into the silver layer (Delta tables), and gets aggregated into the Gold layer (star schema for BI). The DP-700 tests whether you know which transformation engine to use when: notebooks for complex code logic, Dataflows Gen2 for low-code scenarios, and T-SQL for relational transformations in warehouses.

Step 4: Implement Incremental Loading

Full reloads work for small datasets, but enterprise data grows. The DP-700 tests incremental loading techniques heavily.

Watermark pattern: Store the last processed timestamp or ID in a watermark table. Each pipeline run queries the watermark, processes only records newer than the watermark, then updates the watermark value.

Change data capture: If your source system supports CDC, use it to identify insert, update, and delete operations. Apply each operation to the target Delta table accordingly.

Shortcut vs. copy: If the source data does not need transformation, consider a OneLake shortcut instead of copying data. Shortcuts reference data in place without duplication, which reduces storage costs and keeps the data fresh. The exam loves tests scenarios where shortcut is the better answer than pipeline copy.

Step 5: Configure Security and Governance

Enterprise Fabric solutions require proper security boundaries. The DP-700 tests multiple security layers:

Workspace roles: Assign Admin, Member, Contributor, or Viewer roles at the workspace level. Only workspace admins can manage workspace settings.

Item permissions: Share individual items (lakehouses, notebooks, pipelines) with specific users, independent of workspace role.

Row-level security (RLS): Define row-level filters in the SQL analytics endpoint using T-SQL. For example, restrict sales data to each salesperson's region.

Column-level security: Block access to specific columns containing sensitive data like social security numbers or salary information.

OneLake security: Control access at the folder and file level within the lakehouse. This is the most granular security layer in Fabric.

Sensitivity labels: Apply Microsoft Purview sensitivity labels to classify data. Labels propagate across Fabric and downstream Power BI reports, giving you consistent data governance.

Step 6: Monitor and Optimize Performance

Once your solution runs in production, monitoring is critical. The DP-700 devotes an entire domain to monitoring and optimization.

Pipeline monitoring: Check the pipeline run history for failures, duration trends, and activity-level diagnostics. Failed activities show error messages in the output JSON. Common issues include permission errors, schema drift in source files, and timeout limits.

Notebook performance: Use Spark job metrics to identify slow stages, data skew, and excessive shuffle operations. If a PySpark job is slow, try:

  • Repartitioning data to match the cluster size
  • Caching intermediate DataFrames that are reused
  • Avoiding wide transformations on large datasets
  • Enabling Spark adaptive query execution

Lakehouse table maintenance: Delta tables require periodic maintenance. Run OPTIMIZE to compact small files into larger ones, and VACUUM to remove old files beyond the retention threshold. These operations improve query performance significantly.

Warehouse query tuning: If you move data to a Fabric warehouse for relational analytics, use query execution plans to identify missing statistics, poor join strategies, and table scans that should be seeks.

Step 7: Set Up Deployment Pipelines

Moving from development to production requires controlled promotion. Fabric deployment pipelines let you promote items (lakehouses, notebooks, pipelines, semantic models) from dev to test to workspaces with comparison views and deployment rules.

Connect your dev workspace to a Git repository (Azure DevOps or GitHub) for version control. Each Fabric item type maps to a specific file format in the repository, enabling pull request reviews and rollback capabilities. The DP-700 tests whether you know when to use deployment pipelines (environment promotion) versus Git integration (source control and collaboration). They complement each other.

FAQ

What is the difference between a lakehouse and a warehouse in Fabric?

A lakehouse handles files and Spark-based transformations using PySpark, SQL, and Scala. A warehouse is optimized for T-SQL analytics and dimensional modeling. Use a lakehouse for raw and curated data engineering; use a warehouse for relational analytics serving Power BI.

When should I use a notebook versus a Dataflow Gen2?

Use a notebook for complex transformations requiring PySpark, custom logic, or iterative development. Use a Dataflow Gen2 for low-code, Power Query-based transformations that business analysts can maintain.

How do incremental loads work in Fabric pipelines?

Incremental loads use watermark tables, timestamps, or change data capture to process only new or changed records. The pipeline reads the watermark, fetches new data, transforms it, writes it, then updates the watermark.

What is the Bronze/Silver/Gold pattern?

Bronze: raw data as-is from source. Silver: cleansed, validated, deduplicated data. Gold: business-ready aggregations and star schemas optimized for reporting.

How do I handle schema drift in source files?

Use schema drift support in Dataflows Gen2 or PySpark's mergeSchema option when reading files. Define default values for missing columns to prevent pipeline failures.

Ready to test your DP-700 knowledge? Practice with 35 free questions at cert-pass.com/exams/azure-dp-700-microsoft-fabric-data-engineer-associate/take. Full prep with 1000+ questions, explanations, and mock exams starts at EUR 29.

school

Cert-Pass Editorial Team

Cloud certification experts helping IT professionals pass their exams with confidence.

link Related Exam Resources

Expert-Crafted Study Guide

Everything You Need to Pass DP-700 Microsoft Fabric Data Engineer Associate: Visualized

DP-700 Microsoft Fabric Data Engineer Associate certification preparation infographic

Put your knowledge to the test

Practice with real exam questions, track your progress, and pass with confidence.

quiz Start Practicing Free