Course Outline
Introduction, Objectives, and Migration Strategy
- Course goals, alignment with participant profiles, and success criteria
- High-level migration approaches and risk considerations
- Setting up workspaces, repositories, and lab datasets
Day 1 — Migration Fundamentals and Architecture
- Core Lakehouse concepts, Delta Lake overview, and Databricks architecture
- Differences between SMP and MPP and their implications for migration
- Medallion (Bronze→Silver→Gold) design and an overview of Unity Catalog
Day 1 Lab — Translating a Stored Procedure
- Hands-on migration of a sample stored procedure into a notebook
- Mapping temporary tables and cursors to DataFrame transformations
- Validation and comparison with the original output
Day 2 — Advanced Delta Lake & Incremental Loading
- ACID transactions, commit logs, versioning, and time travel capabilities
- Auto Loader, MERGE INTO patterns, upserts, and schema evolution
- OPTIMIZE, VACUUM, Z-ORDER, partitioning, and storage tuning techniques
Day 2 Lab — Incremental Ingestion & Optimization
- Implementing Auto Loader ingestion and MERGE workflows
- Applying OPTIMIZE, Z-ORDER, and VACUUM; validating results
- Measuring read/write performance improvements
Day 3 — SQL in Databricks, Performance & Debugging
- Analytical SQL features: window functions, higher-order functions, and JSON/array handling
- Interpreting the Spark UI, DAGs, shuffles, stages, tasks, and diagnosing bottlenecks
- Query tuning patterns: broadcast joins, hints, caching, and spill reduction
Day 3 Lab — SQL Refactoring & Performance Tuning
- Refactoring a heavy SQL process into optimized Spark SQL
- Using Spark UI traces to identify and fix skew and shuffle issues
- Benchmarking before/after scenarios and documenting tuning steps
Day 4 — Tactical PySpark: Replacing Procedural Logic
- Spark execution model: driver, executors, lazy evaluation, and partitioning strategies
- Transforming loops and cursors into vectorized DataFrame operations
- Modularization, UDFs/pandas UDFs, widgets, and creating reusable libraries
Day 4 Lab — Refactoring Procedural Scripts
- Refactoring a procedural ETL script into modular PySpark notebooks
- Introducing parametrization, unit-style tests, and reusable functions
- Code review and application of best-practice checklists
Day 5 — Orchestration, End-to-End Pipeline & Best Practices
- Databricks Workflows: job design, task dependencies, triggers, and error handling
- Designing incremental Medallion pipelines with quality rules and schema validation
- Integration with Git (GitHub/Azure DevOps), CI, and testing strategies for PySpark logic
Day 5 Lab — Build a Complete End-to-End Pipeline
- Assembling a Bronze→Silver→Gold pipeline orchestrated with Workflows
- Implementing logging, auditing, retries, and automated validations
- Running the full pipeline, validating outputs, and preparing deployment notes
Operationalization, Governance, and Production Readiness
- Unity Catalog governance, lineage, and access controls best practices
- Cost management, cluster sizing, autoscaling, and job concurrency patterns
- Deployment checklists, rollback strategies, and runbook creation
Final Review, Knowledge Transfer, and Next Steps
- Participant presentations of migration work and lessons learned
- Gap analysis, recommended follow-up activities, and handoff of training materials
- References, further learning paths, and support options
Requirements
- A solid understanding of data engineering concepts
- Experience with SQL and stored procedures (Synapse / SQL Server)
- Familiarity with ETL orchestration concepts (ADF or similar tools)
Audience
- Technology managers with a background in data engineering
- Data engineers transitioning procedural OLAP logic to Lakehouse patterns
- Platform engineers responsible for driving Databricks adoption