Thank you for sending your enquiry! One of our team members will contact you shortly.
Thank you for sending your booking! One of our team members will contact you shortly.
Course Outline
-
Introduction
- History and core concepts of Hadoop
- The Hadoop Ecosystem
- Hadoop Distributions
- High-level architecture
- Common Hadoop myths
- Challenges of Hadoop (hardware and software)
- Labs: Discussing your Big Data projects and challenges
-
Planning and installation
- Choosing software and Hadoop distributions
- Sizing the cluster and planning for future growth
- Selecting hardware and network configurations
- Rack topology
- Installation procedures
- Multi-tenancy
- Directory structures and logs
- Benchmarking
- Labs: Installing the cluster and running performance benchmarks
-
HDFS operations
- Core concepts: horizontal scaling, replication, data locality, and rack awareness
- Nodes and daemons: NameNode, Secondary NameNode, HA Standby NameNode, DataNode
- Health monitoring
- Command-line and browser-based administration
- Adding storage and replacing defective drives
- Labs: Getting familiar with HDFS command lines
-
Data ingestion
- Using Flume for log ingestion and other data entry into HDFS
- Using Sqoop for importing data from SQL databases to HDFS, as well as exporting back to SQL
- Implementing Hadoop data warehousing with Hive
- Copying data between clusters (distcp)
- Utilizing S3 as a complement to HDFS
- Best practices and architectures for data ingestion
- Labs: Setting up and utilizing Flume and Sqoop
-
MapReduce operations and administration
- Parallel computing prior to MapReduce: comparing HPC with Hadoop administration
- Managing MapReduce cluster loads
- Nodes and Daemons: JobTracker and TaskTracker
- Walkthrough of the MapReduce UI
- MapReduce configuration
- Job configuration
- Optimizing MapReduce performance
- Ensuring robustness in MR: Guidance for programmers
- Labs: Running MapReduce examples
-
YARN: New architecture and capabilities
- Design goals and implementation architecture of YARN
- New components: ResourceManager, NodeManager, and Application Master
- Installing YARN
- Job scheduling under YARN
- Labs: Investigating job scheduling
-
Advanced topics
- Hardware monitoring
- Cluster monitoring
- Adding and removing servers, and upgrading Hadoop
- Backup, recovery, and business continuity planning
- Oozie job workflows
- Hadoop High Availability (HA)
- Hadoop Federation
- Securing your cluster with Kerberos
- Labs: Setting up monitoring
-
Optional tracks
- Cloudera Manager for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are conducted within the Cloudera distribution environment (CDH5).
- Ambari for cluster administration, monitoring, and routine tasks; installation and usage. In this track, all exercises and labs are conducted within the Ambari cluster manager and Hortonworks Data Platform (HDP 2.0).
Requirements
- Proficiency in basic Linux system administration
- Basic scripting skills
Prior knowledge of Hadoop and Distributed Computing is not required, as these topics will be introduced and explained throughout the course.
Lab environment
Zero Install: There is no need to install Hadoop software on your personal machines! A functional Hadoop cluster will be provided for student use.
Students will need the following:
- An SSH client (Linux and Mac systems come with SSH clients pre-installed; for Windows, PuTTY is recommended)
- A browser to access the cluster. We recommend using Firefox with the FoxyProxy extension installed.
21 Hours
Testimonials (1)
Hands on exercises. Class should have been 5 days, but the 3 days helped to clear up a lot of questions that I had from working with NiFi already