Get in Touch

Course Outline

Introduction:

  • Apache Spark within the Hadoop Ecosystem
  • Brief overview of Python and Scala

Core Concepts (Theory):

  • Architecture
  • Resilient Distributed Datasets (RDDs)
  • Transformations and Actions
  • Stages, Tasks, and Dependencies

Mastering the Basics via Databricks (Hands-on Workshop):

  • RDD API exercises
  • Fundamental action and transformation functions
  • PairRDDs
  • Join operations
  • Caching strategies
  • DataFrame API exercises
  • SparkSQL
  • DataFrame operations: select, filter, group, and sort
  • User-Defined Functions (UDFs)
  • Introduction to the Dataset API
  • Streaming capabilities

Understanding Deployment via AWS (Hands-on Workshop):

  • Fundamentals of AWS Glue
  • Distinguishing between AWS EMR and AWS Glue
  • Sample jobs on both platforms
  • Analysis of pros and cons for each service

Additional Topics:

  • Introduction to Apache Airflow orchestration

Requirements

Programming proficiency (preferably in Python or Scala)

Foundational knowledge of SQL

 21 Hours

Number of participants


Price per participant

Testimonials (3)

Upcoming Courses

Related Categories