Hadoop Introduction

Why hadoop is needed?

  • Social network websites or ecommerce websites track customer behavior on the website and the server relevant information / product.
  • Any global bank today has more than 100 million customers doing billions of transactions every month.

Traditional system find it difficult to cope up with this scale at required pace in cost-efficient manner

This is where big data platforms come to help. In this article, we introduce you to thee mesmerizing world of hadoop, hadoop come handy when we deal with enormous data. It may not make the process faster, but gives ys the capability to use parallel processing capability to handle big data. In short, hadoop gives us capability to deal with the complexities of high volume, velocity and variety of data. (popularly known as 3Vs)



Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Hadoop is a complete eco-system of open source projects that provide us the framework to deal with big data.


 Pre-installation setup

Step 1: Plan your cluster













  • Manages the file system namespace.
  • Regulates clients access to files.
  • It also executes file system operation such as renaming, closing, and opening files and directories.



  • Datanode perform read-write operations on the file systems, as per client request.
  • They also perform operations such as block creation, deletion, and replication according to the instruction of the name node.

Step 2: Hadoop requirements.

You must make sure that the minimum hardware and software requirements are met.

Hardware requirements

Before you install Hadoop, you must make sure that minimum hardware requirements are met.

Minimum hardware requirements for the namenode node:

  • 25GB free disk space.
  • 2GB of physical memory (RAM).
  • 2 static Ethernet configured interface.

Minimum hardware requirements for datanode  nodes:

  • 25 GB free disk space.
  • 2 GB of physical memory (RAM).
  • 2 static Ethernet configured interface.

Software requirements

one of the following operating system is required.

  • Red Hat Enterprise Linux (RHEL)6.5 x86 (64bit).
  • CentOS 6.8 x86 (64 bit).


Step 4: Hadoop environment

Configure the Interfaces

Here we need two Network Interface Cards (NIC), in every node of cluster need three NICs, so commonly while installing we create the Network Interface Cards (NIC).



[root@node01 ~]# ifconfig eth0

eth0      Link encap:Ethernet  HWaddr 00:0C:29:C6:CA:2A

inet addr:  Bcast:  Mask:

inet6 addr: fe80::20c:29ff:fec6:ca2a/64 Scope:Link


RX packets:84 errors:0 dropped:0 overruns:0 frame:0

TX packets:43 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:7891 (7.7 KiB)  TX bytes:8067 (7.8 KiB)




[root@node01 ~]# ifconfig eth1

eth1      Link encap:Ethernet  HWaddr 00:0C:29:C6:CA:34

inet addr:  Bcast:  Mask:

inet6 addr: fe80::20c:29ff:fec6:ca34/64 Scope:Link


RX packets:38 errors:0 dropped:0 overruns:0 frame:0

TX packets:10 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:2651 (2.5 KiB)  TX bytes:636 (636.0 b)


Network configuration

Here we are assign the IP address to all interfaces.

#vim /etc/sysconfig/network-scripts/ifcfg-eth0















#vim /etc/sysconfig/network-scripts/ifcfg-eth1














Stop the Firewall

#service iptables stop

#chkconfig iptables off

#iptables –F

#service ip6tables stop

#chkconfig ip6tables off

#ip6tables -F

#service NetworkManager stop

#chkconfig NetworkManager off


Step 5: Setup Hostname and FQDN

In this step we will assign the hostnames to all nodes, and here we are create Fully Qualified Domain Name (FQDN).

  • Get nodes properly on the networks
  • Ensures that the hostnames and IP address are correct and correctly recorded in DNS entries and/or in hosts entries.
  • Edit /etc/hosts file on all master and slave servers.


[root@node01 ~]# cat /etc/hosts localhost localhost.localdomain localhost4 localhost4.localdomain4

::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
    namenode01.unixadmin.in        namenode01    namenode02.unixadmin.in        namenode02    datanode01.unixadmin.in          datanode01    datanode02.unixadmin.in          datanode02



Step 6: Configure SSH auto login

  • SSH setup is required to do different operations on a cluster such as starting, stopping, distributed daemon shell operations, to authenticate different users of hadoop, it is required to provide public / private key pair for a hadoop user and share is with different users.
  • Verify node-to-node rsh/ssh communication over the nodes hostname as well as fully qualified domain name (FQDN), including rsh/ssh to self


[root@namenode01 ~]# ssh-keygen

Generating public/private rsa key pair.

Enter file in which to save the key (/root/.ssh/id_rsa):

Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /root/.ssh/id_rsa.

Your public key has been saved in /root/.ssh/id_rsa.pub.

The key fingerprint is:

10:a9:10:7e:50:3c:db:f6:10:43:17:87:de:9b:18:cc root@bridge.unixadmin.in

The key’s randomart image is:

+–[ RSA 2048]—-+

|  o+..o.oo.      |

| …o +o..       |

|  …=.* .       |

|   .o +.E .      |

|     . oSo o     |

|        o o      |

|                 |

|                 |

|                 |



[root@namenode01 ~]# ssh-copy-id localhost

The authenticity of host ‘localhost (::1)’ can’t be established.

RSA key fingerprint is cf:a1:ab:1c:be:a6:2b:ba:94:64:db:df:bd:52:a5:67.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added ‘localhost’ (RSA) to the list of known hosts.

root@localhost’s password:

Now try logging into the machine, with “ssh ‘localhost'”, and check in:




to make sure we haven’t added extra keys that you weren’t expecting.


[root@namenode01 ~]# ssh-copy-id

root@’s password:

Now try logging into the machine, with “ssh ‘’”, and check in:




to make sure we haven’t added extra keys that you weren’t expecting.


Java Installation:

During installation of java using rpm file I found issues many times. after that I found a better way to install java from sun site, using below steps I have installed java successfully many times without facing any issues, we can also install multiple version of java easily if required.

Downloading latest java Archive:

[root@namenode01 ~]# wget –“http://download.orcale.com/otn-pub/java/jdk/8u101-b14/jdk-8u101-linux-x64.tar.gz

# tar –xvzf jdk-8u101-linux-x64.tar.gz

Install java with alternatives:

After extracting java archive file, we just need to set up use newer version of java using alternatives.

[root@namenode01 ~]# cd /opt/jdk1.8.0_101/


[root@namenode01 ~]#  alternatives –install /usr/bin/java java /opt/jdk1.8.0_101/bin/java 2

[root@namenode01 ~]#  alternatives –config java


There are 3 programs which provide ‘java’.

Selection    Command


*  1           /opt/jdk1.7.0_71/bin/java

+ 2           /opt/jdk1.8.0_45/bin/java

3           /opt/jdk1.8.0_91/bin/java

4           /opt/jdk1.8.0_101/bin/java


Enter to keep the current selection[+], or type selection number: 4


At this point JAVA 8 has been successfully installed on your system. We also recommend to setup javac and jar commands path using alternatives

[root@namenode01 ~]# alternatives –install /usr/bin/jar jar /app/jdk1.8.0_101/bin/jar 2

[root@namenode01 ~]# alternatives –install /usr/bin/javac javac /app/jdk1.8.0_101/bin/javac 2

[root@namenode01 ~]# alternatives –set jar /app/jdk1.8.0_101/bin/jar

[root@namenode01 ~]# alternatives –set javac /app/jdk1.8.0_101/bin/javac


Check Installed Java Version:

[root@namenode01 ~]#  java –version

java version “1.8.0_101”

Java(TM) SE Runtime Environment (build 1.8.0_101-b14)

Java HotSpot(TM) 64-Bit Server VM (build 25.101-b14, mixed mode


Configuring Environment Variables:


  • Setup JAVA_HOME variable


[root@namenode01 ~]# vim /root/.bashrc

export JRE_HOME=/app/jdk1.8.0_101/jre


  • Setup JRE_HOME Variable
[root@namenode01 ~]# vim /root/.bashrc

export JRE_HOME=/app/jdk1.8.0_101/jre


  • Setup PATH Variable
[root@namenode01 ~]# vim /root/.bashrc

export PATH=$PATH:/app/jdk1.8.0_101/bin:/app/jdk1.8.0_101/jre/bin


Installation of hadoop-2.7.3

Download hadoop latest available version from its official site at hadoop namenode only.

Create one directory and extract the hadoop tar file in it, after extracting the tar file  install the extracted file using tar –xvzf command.

[root@namenode01 ~]# mkdir /app

[root@namenode01 ~]# cd /app

[root@namenode01 ~]# wget http://apache.claz.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

[root@namenode01 ~]#tar –xvzf hadoop-2.7.3.tar


Setup environment variables:

  • Set environment variables in /.bashrc configuration file.
[root@namenode01 ~]# vim /root/.bashrc


export HADOOP_HOME=/app/hadoop-2.7.3





export HADOOP_CONF_DIR=/app/hadoop-2.7.3/etc/hadoop





Exports the variables

  • Now apply all the changes into the current running system.
  • To edit this open ~.bashrc configuration file.
[root@namenode01 ~]# source /root/.bashrc


Edit configuration files:


          Core-site.xml is as configuration fie in Hadoop where you keep all your HDFS related configurations.

E.g:  namendoe host and port, the local directory where namenode related stuff be saved etc.

[root@namenode01 ~]# vim core-site.xml








  • The hdfs-site.xml file contains the configuration settings for HDFS daemons, the Namenode, the Secondary Namenode, and the Datanodes, here, we can configure hdfs-site.xml to specify default block replication and permission checking on HDFS
  • The actual number of replications can also specified when the file is created.
  • The default is used if replication is not specified in create time.
[root@namenode01 ~]#vim hdfs-site.xml




<name> dfs.namenode.name.dir</name>














  • The mapred-site.xml file contains the configuration settings for Mapreduce daemons,
    • The job tracker
    • The task-tracker
[root@namenode01 ~]#vim mapred-site.xml








YARN is the component responsible for allocating containers to run tasks, coordinating the execution of said tasks, restart them in case of failure, among other housekeeping. Jusk like HDFS, and it also has 2 main components.

  • A ResourceManager which keeps track of the cluster resources and
  • A Nodemanager in each of the nodes which communicates with the ResourceManager and sets up conatiners for execution of tasks.



 [root@namenode01 ~]# vim yarn-site.xml
























HDFS Operations

Format namenode:

Format the configured HDFS file system, open namenode or HDFS server, and execute the following command.

[root@namenode01 ~]# hdfs namenode01 –format


Now run start-dfs.sh script:

After formatting the namenode, start the distributed file system. The following command will start the namenode as well as the data nodes as cluster.

[root@namenode01 ~]# start-dfs.sh

Now run start-yarn.sh script:

After starting the DFS, start the yarn file, the following command will start the namenode as well as the data nodes as cluster.

[root@namenode01 ~]# start-yarn.sh


Be the first to comment on "HADOOP"

Leave a comment

Your email address will not be published.