Hadoop Developer

User Avatar


Hadoop is an open source, the Java-based programming framework that enables the storage and processing of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software. Big Data Hadoop Training, makes it possible to run applications on systems with thousands of hardware nodes, and thousands of terabytes of data. It is a distributed file system that facilitates rapid data transfer rates among nodes and allows the system to continue operating in case of a node failure. This approach lowers the risk of catastrophic system failure and unexpected data loss


  • Software developers
  • ETL developer
  • Fresh Graduates

Apache Hadoop

MODULE 1 : Hadoop Installation & Setup

  • Hadoop 2.x Cluster Architecture
  • Federation and High Availability
  • Typical Production Cluster setup
  • Hadoop Cluster Modes
  • Common Hadoop Shell Commands
  • Hadoop 2.x Configuration Files
  • Cloudera Single node cluster
  • Hive
  • Pig
  • Sqoop
  • Flume
  • Scala
  • Spark

module 2 : Big Data Hadoop Training: Understanding HDFS & MapReduce

  • Introducing to Big Data & Hadoop
  • What is Big Data and where does Hadoop fit in
  • Two important Hadoop ecosystem components namely Map Reduce and HDFS
  • In-depth Hadoop Distributed File System – Replications, Block Size,
    Secondary Name node
  • Big Data Hadoop Training with High Availability
  • In-depth YARN – Resource Manager, Node Manager.
    • Hands-on Exercise –
    • Working with HDFS
    • Replicating the data
    • Determining block size
    • Familiarizing with Namenode and Datanode.

module 3 : Deep Dive in MapReduce

  • Detailed understanding of the working of MapReduce
  • The mapping and reducing process
  • The working of Driver
  • Combiners
  • Partitioners
  • Input Formats
  • Output Formats
  • Shuffle and Sort
    • Hands-on Exercise
    • The detailed methodology for writing the Word Count Program in MapReduce
    • Writing custom partitioner
    • MapReduce with Combiner
    • Local Job Runner Mode
    • Unit Test, ToolRunner
    • Map Side Join
    • Reduce Side Join
    • Using Counters
    • Joining two datasets using Map-Side Join Vs Reduce-Side Join

module 4 : Hadoop Administration – Multi-Node Cluster Setup using Amazon EC2

  • Create a four-node Hadoop cluster setup
  • Running the MapReduce Jobs on the Hadoop cluster
  • Successfully running the MapReduce code
  • Working with the Cloudera Manager setup.
    • Hands-on Exercise
    • The method to build a multi-node Hadoop cluster using an Amazon EC2
    • Working with the Cloudera Manager.

module 5 : Hadoop Administration – Cluster Configuration

  • Hadoop configuration
  • Importance of Hadoop configuration file
  • Various parameters and values of configuration
  • HDFS parameters and MapReduce parameters
  • Setting up the Hadoop environment
  • Include and Exclude configuration files
  • Administration and maintenance of Name node
  • Data node directory structures and files
  • File system image and Edit log
    • Hands-on Exercise
    • The method to do performance tuning of MapReduce program.

module 6 : Hadoop Administration – Maintenance, Monitoring and Troubleshooting

  • Introduction to the Checkpoint Procedure
  • Namenode failure and how to ensure the recovery procedure
  • Safe Mode
  • Metadata and Data backup
  • Various potential problems and solutions
  • What to look for
  • How to add and remove nodes
    • Hands-on Exercise
    • How to go about ensuring the MapReduce Filesystem Recovery for various
      different scenarios
    • JMX monitoring of the Hadoop cluster
    • How to use the logs and stack traces for monitoring and troubleshooting
    • Using the Job Scheduler for scheduling jobs in the same cluster
    • Getting the MapReduce job submission flow
    • FIFO schedule
    • Getting to know the Fair Scheduler and its configuration.

module 8 : Hadoop Application Testing

  • Why testing is important
  • Unit testing & Integration testing
  • Performance Testing Diagnostics
  • Nightly QA test
  • Benchmark and end to end tests
  • Functional testing
  • Release certification testing
  • Security testing
  • Scalability Testing Commissioning and Decommissioning of Data Nodes Testing
  • Reliability testing
  • Release testing

module 9 : Roles and Responsibilities of Hadoop Testing Professional

  • Understanding the Requirement,
  • Preparation of the Testing Estimation
  • Test Cases
  • Test Data
  • Testbed creation
  • Test Execution
  • Defect Reporting
  • Defect Retest
  • Daily Status report delivery
  • Test completion
  • ETL testing at every stage (HDFS, Hive, HBase) while loading the input (logs/files/records etc)
  • Using scoop/flume which includes but not limited to data verification
  • Reconciliation
  • User Authorization and Authentication testing (Groups, Users, Privileges etc)
  • Report defects to the development team or manager and driving them to closure
  • Consolidate all the defects and create defect reports
  • Validating new feature and issues in Core Hadoop.

module 10 : Framework called MR Unit for Testing of Map-Reduce Programs

  • Report defects to the development team or manager and driving them to closure
  • Consolidate all the defects and create defect reports
  • Responsible for creating a testing Framework called MR Unit for testing of Map-Reduce programs.

module 11 : Unit Testing

  • Automation testing using the OOZIE
  • Data validation using the query surge tool.

module 12 : Test Execution

  • The test plan for HDFS upgrade
  • Test automation and result

module 13 : Test Plan Strategy and writing Test Cases for testing Hadoop Application

  • How to test install and configure

Apache HBase

HBase is an open-source, non-relational, distributed database modeled after Google’s Bigtable and is written in Java. It provides a fault-tolerant way of storing large quantities of sparse data. HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original Bigtable paper. Tables in HBase can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway APIs. HBase is a column-oriented key-value data store and has been idolized widely because of its lineage with Hadoop and HDFS. HBase runs on top of HDFS and is well-suited for faster read and write operations on large datasets with high throughput and low input/output latency.

module 14 : HBase Overview

  • Getting started with HBase
  • Core concepts of HBase
  • Understanding HBase with an Example

module 15 : Architecture of NoSQL

  • Why HBase?
  • Where to use HBase?
  • What is NoSQL?

module 16 : HBase Data Modeling

  • HDFS vs.HBase
  • HBase Use Cases
  • Data Modeling HBase

MODULE 17 : HBase Cluster Components

  • HBase Architecture
  • Main components of HBase Cluster

module 18 : HBase API and Advanced Operations

  • HBase Shell
  • HBase API
  • Primary Operations
  • Advanced Operations

module 19 : Integration of Hive with HBase

  • Create a Table and Insert Data into it
  • Integration of Hive with HBase,
  • Load Utility

module 20 : File loading with both load Utility

  • Putting Folder to VM
  • File loading with both load Utility

module 21 : Apache Pig

  • Apache Pig introduction, its various features
  • Various data types and schema in Hive
  • Available functions in Pig
  • Hive Bags
  • Tuples

Apache Hive

Apache Hive is a data warehouse software built on top of Apache Hadoop for providing data summarization, query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

module 22 : Introduction to Hive

  • Introducing Hadoop Hive
  • The detailed architecture of Hive
  • Comparing Hive with Pig and RDBMS
  • Working with Hive Query Language
  • Creation of database, table, group by and other clauses, the various types of Hive tables
  • Hcatalog
  • Storing the Hive Results
  • Hive partitioning and Buckets.
  • The indexing in Hive
  • The Map side Join in Hive
  • Working with complex data types
  • Hive User-defined Functions

MODULE 23 : Apache Impala

  • Introduction to Impala
  • Comparing Hive with Impala
  • The detailed architecture of Impala

module 24 : Apache Flume

  • Introduction to Flume and its Architecture

Apache Kafka

Apache Kafka is an open-source stream-processing software platform written in Scala and Java. It aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a “massively scalable pub/sub message queue architected as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data. Additionally, Kafka connects to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library

module 25 : What is Kafka – An Introduction

  • Understanding what is Apache Kafka
  • The various components and use cases of Kafka
  • Implementing Kafka on a single node.

module 26 : Multi-Broker Kafka Implementation

  • Learning about the Kafka terminology
  • Deploying single node Kafka with independent Zookeeper
  • Adding replication in Kafka
  • Working with Partitioning and Brokers
  • Understanding Kafka consumers
  • Kafka Writes terminology
  • Various failure handling scenarios in Kafka.

module 27 : Multi-Node Cluster Setup

  • Introduction to multi-node cluster setup in Kafka
  • Various administration commands
  • Leadership balancing and partition rebalancing
  • Graceful shutdown of Kafka Brokers and tasks
  • Working with the Partition Reassignment Tool
  • Cluster expending
  • Assigning Custom Partition
  • Removing of a Broker and improving Replication Factor of Partitions.

module 28 : Integrate Flume with Kafka

  • Understanding the need for Kafka Integration
  • Successfully integrating it with Apache Flume
  • Steps in the integration of Flume with Kafka as a Source.

module 29 : Kafka API

  • Detailed understanding of the Kafka and Flume Integration
  • Deploying Kafka as a Sink and as a Channel
  • Introduction to PyKafka API
  • Setting up the PyKafka Environment.

module 30 : Producers & Consumers

  • Connecting Kafka using PyKafka
  • Writing your own Kafka Producers and Consumers
  • Writing a random JSON Producer
  • Writing a Consumer to read the messages on a topic
  • Writing and working with a File Reader Producer
  • Writing a Consumer to store topics data into a file.

Course Features

  • Lectures 0
  • Quizzes 0
  • Duration 50 hours
  • Skill level All levels
  • Language English
  • Students 0
  • Assessments Yes

You have 10 weeks remaining for the course

Curriculum is empty

Leave A Reply

Your email address will not be published. Required fields are marked *