Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
-
Introduction
- History and core concepts of Hadoop
- The Hadoop ecosystem
- Overview of distributions
- High-level architecture
- Common misconceptions about Hadoop
- Hadoop challenges (hardware and software)
- Labs: Discussion of participants' Big Data projects and challenges
-
Planning and Installation
- Selecting software and Hadoop distributions
- Cluster sizing and growth planning
- Hardware and network selection
- Rack topology considerations
- Installation procedures
- Multi-tenancy management
- Directory structure and log management
- Benchmarking techniques
- Labs: Installing the cluster and running performance benchmarks
-
HDFS Operations
- Core concepts (horizontal scaling, replication, data locality, rack awareness)
- Nodes and daemons (NameNode, Secondary NameNode, HA Standby NameNode, DataNode)
- Health monitoring strategies
- Command-line and browser-based administration
- Expanding storage and replacing defective drives
- Labs: Getting familiar with HDFS command lines
-
Data Ingestion
- Using Flume for log and data ingestion into HDFS
- Utilizing Sqoop for importing data from SQL databases to HDFS and exporting back to SQL
- Data warehousing with Hive
- Transferring data between clusters using distcp
- Integrating S3 as a complement to HDFS
- Best practices and architectures for data ingestion
- Labs: Setting up and using Flume and Sqoop
-
MapReduce Operations and Administration
- Evolution of parallel computing: Comparing HPC with Hadoop administration
- Managing MapReduce cluster loads
- Nodes and Daemons (JobTracker, TaskTracker)
- Guided tour of the MapReduce UI
- MapReduce configuration
- Job configuration
- Strategies for optimizing MapReduce
- Preventing issues: Guidance for developers
- Labs: Executing MapReduce examples
-
YARN: New Architecture and Capabilities
- YARN design objectives and implementation architecture
- Key components: ResourceManager, NodeManager, Application Master
- Installing YARN
- Job scheduling within YARN
- Labs: Investigating job scheduling mechanisms
-
Advanced Topics
- Hardware monitoring
- Cluster monitoring
- Adding and removing servers, and upgrading Hadoop
- Backup, recovery, and business continuity planning
- Oozie job workflows
- Hadoop High Availability (HA)
- Hadoop Federation
- Securing the cluster with Kerberos
- Labs: Configuring monitoring systems
-
Optional Tracks
- Cloudera Manager for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are conducted within the Cloudera distribution environment (CDH5).
- Ambari for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are performed using the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0).
Requirements
- Familiarity with basic Linux system administration
- Basic scripting proficiency
Prior knowledge of Hadoop and Distributed Computing is not required, as these topics will be introduced and explained throughout the course.
Lab Environment
Zero Installation Required: Students do not need to install Hadoop software on their personal machines. A fully functional Hadoop cluster will be provided for use during the session.
Participants must have the following tools:
- An SSH client (Linux and Mac systems come with built-in SSH clients; for Windows, PuTTY is recommended)
- A web browser to access the cluster. We recommend using Firefox with the FoxyProxy extension installed.
21 Hours
Testimonials (1)
Hands on exercises. Class should have been 5 days, but the 3 days helped to clear up a lot of questions that I had from working with NiFi already