Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop Ecosystem
  • Brief overview of Python and Scala

Core Concepts (Theory):

  • Architecture
  • Resilient Distributed Datasets (RDD)
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Foundational Skills via Databricks (Hands-on Workshop):

  • Exercises with the RDD API
  • Basic transformation and action functions
  • PairRDDs
  • Join operations
  • Strategies for caching
  • Exercises with the DataFrame API
  • SparkSQL
  • DataFrame operations: select, filter, group, and sort
  • User-Defined Functions (UDF)
  • Exploration of the DataFrame API
  • Streaming capabilities

Deployment in AWS (Hands-on Workshop):

  • Overview of AWS Glue
  • Comparing AWS EMR and AWS Glue
  • Example job implementations in both environments
  • Analyzing pros and cons

Additional Topics:

  • Introduction to Apache Airflow for orchestration

Requirements

Programming skills (preferably in Python or Scala)

Basic knowledge of SQL

 21 Hours

Number of participants


Price per participant

Testimonials (3)

Upcoming Courses

Related Categories