HADOOP

Hadoop Introduction

Why hadoop is needed?

  • Social network websites or ecommerce websites track customer behavior on the website and the server relevant information / product.
  • Any global bank today has more than 100 million customers doing billions of transactions every month.

Traditional system find it difficult to cope up with this scale at required pace in cost-efficient manner

This is where big data platforms come to help. In this article, we introduce you to thee mesmerizing world of hadoop, hadoop come handy when we deal with enormous data. It may not make the process faster, but gives ys the capability to use parallel processing capability to handle big data. In short, hadoop gives us capability to deal with the complexities of high volume, velocity and variety of data. (popularly known as 3Vs)

 

Introduction:

Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Hadoop is a complete eco-system of open source projects that provide us the framework to deal with big data.

 

 Pre-installation setup

Step 1: Plan your cluster

Namenode01

Namenode02

Namenode03

Secondarynamenode

Datanode01

Datanode02

Datanode03

Datanode04

Datanode05

Datanode06

 

Namenode:

  • Manages the file system namespace.
  • Regulates clients access to files.
  • It also executes file system operation such as renaming, closing, and opening files and directories.

 

Datanode:

  • Datanode perform read-write operations on the file systems, as per client request.
  • They also perform operations such as block creation, deletion, and replication according to the instruction of the name node.

Step 2: Hadoop requirements.

You must make sure that the minimum hardware and software requirements are met.

Hardware requirements

Before you install Hadoop, you must make sure that minimum hardware requirements are met.

Minimum hardware requirements for the namenode node:

  • 25GB free disk space.
  • 2GB of physical memory (RAM).
  • 2 static Ethernet configured interface.

Minimum hardware requirements for datanode  nodes:

  • 25 GB free disk space.
  • 2 GB of physical memory (RAM).
  • 2 static Ethernet configured interface.

Software requirements

one of the following operating system is required.

  • Red Hat Enterprise Linux (RHEL)6.5 x86 (64bit).
  • CentOS 6.8 x86 (64 bit).

 

Step 4: Hadoop environment

Configure the Interfaces

Here we need two Network Interface Cards (NIC), in every node of cluster need three NICs, so commonly while installing we create the Network Interface Cards (NIC).

#ifcfg-eth0

 

[root@node01 ~]# ifconfig eth0

eth0      Link encap:Ethernet  HWaddr 00:0C:29:C6:CA:2A

inet addr:192.168.1.21  Bcast:192.168.1.255  Mask:255.255.255.0

inet6 addr: fe80::20c:29ff:fec6:ca2a/64 Scope:Link

UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

RX packets:84 errors:0 dropped:0 overruns:0 frame:0

TX packets:43 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:7891 (7.7 KiB)  TX bytes:8067 (7.8 KiB)

 

#ifcfg-eth1

 

[root@node01 ~]# ifconfig eth1

eth1      Link encap:Ethernet  HWaddr 00:0C:29:C6:CA:34

inet addr:192.168.2.21  Bcast:192.168.2.255  Mask:255.255.255.0

inet6 addr: fe80::20c:29ff:fec6:ca34/64 Scope:Link

UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

RX packets:38 errors:0 dropped:0 overruns:0 frame:0

TX packets:10 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:2651 (2.5 KiB)  TX bytes:636 (636.0 b)

 

Network configuration

Here we are assign the IP address to all interfaces.

#vim /etc/sysconfig/network-scripts/ifcfg-eth0

 

DEVICE=eth0

HWADDR=00:0c:29:c6:ca:2a

TYPE=Ethernet

UUID=f9ca22db-53c2-415f-9b20-979176564aa9

ONBOOT=no

NM_CONTROLLED=yes

BOOTPROTO=none

IPADDR=192.168.1.21

NETMASK=255.255.255.0

IPV6INIT=no

USERCTL=no

 

 

#vim /etc/sysconfig/network-scripts/ifcfg-eth1

 

DEVICE=eth1

HWADDR=00:0c:29:c6:ca:34

TYPE=Ethernet

UUID=b65b93be-7086-40bf-8ddc-3515464de764

ONBOOT=no

NM_CONTROLLED=yes

BOOTPROTO=none

IPADDR=192.168.2.21

NETMASK=255.255.255.0

IPV6INIT=no

USERCTL=no

 

Stop the Firewall

#service iptables stop

#chkconfig iptables off

#iptables –F

#service ip6tables stop

#chkconfig ip6tables off

#ip6tables -F

#service NetworkManager stop

#chkconfig NetworkManager off

 

Step 5: Setup Hostname and FQDN

In this step we will assign the hostnames to all nodes, and here we are create Fully Qualified Domain Name (FQDN).

  • Get nodes properly on the networks
  • Ensures that the hostnames and IP address are correct and correctly recorded in DNS entries and/or in hosts entries.
  • Edit /etc/hosts file on all master and slave servers.

 

[root@node01 ~]# cat /etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

 

 

 

192.168.1.31    namenode01.unixadmin.in        namenode01

192.168.1.32    namenode02.unixadmin.in        namenode02

192.168.1.33    datanode01.unixadmin.in          datanode01

192.168.1.34    datanode02.unixadmin.in          datanode02

 

 

Step 6: Configure SSH auto login

  • SSH setup is required to do different operations on a cluster such as starting, stopping, distributed daemon shell operations, to authenticate different users of hadoop, it is required to provide public / private key pair for a hadoop user and share is with different users.
  • Verify node-to-node rsh/ssh communication over the nodes hostname as well as fully qualified domain name (FQDN), including rsh/ssh to self

 

[root@namenode01 ~]# ssh-keygen

Generating public/private rsa key pair.

Enter file in which to save the key (/root/.ssh/id_rsa):

Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /root/.ssh/id_rsa.

Your public key has been saved in /root/.ssh/id_rsa.pub.

The key fingerprint is:

10:a9:10:7e:50:3c:db:f6:10:43:17:87:de:9b:18:cc root@bridge.unixadmin.in

The key’s randomart image is:

+–[ RSA 2048]—-+

|  o+..o.oo.      |

| …o +o..       |

|  …=.* .       |

|   .o +.E .      |

|     . oSo o     |

|        o o      |

|                 |

|                 |

|                 |

+—————–+

 

[root@namenode01 ~]# ssh-copy-id localhost

The authenticity of host ‘localhost (::1)’ can’t be established.

RSA key fingerprint is cf:a1:ab:1c:be:a6:2b:ba:94:64:db:df:bd:52:a5:67.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added ‘localhost’ (RSA) to the list of known hosts.

root@localhost’s password:

Now try logging into the machine, with “ssh ‘localhost'”, and check in:

 

.ssh/authorized_keys

 

to make sure we haven’t added extra keys that you weren’t expecting.

 

[root@namenode01 ~]# ssh-copy-id 192.168.1.21

root@192.168.1.21’s password:

Now try logging into the machine, with “ssh ‘192.168.1.21’”, and check in:

 

.ssh/authorized_keys

 

to make sure we haven’t added extra keys that you weren’t expecting.

 

Java Installation:

During installation of java using rpm file I found issues many times. after that I found a better way to install java from sun site, using below steps I have installed java successfully many times without facing any issues, we can also install multiple version of java easily if required.

Downloading latest java Archive:

[root@namenode01 ~]# wget –“http://download.orcale.com/otn-pub/java/jdk/8u101-b14/jdk-8u101-linux-x64.tar.gz

# tar –xvzf jdk-8u101-linux-x64.tar.gz

Install java with alternatives:

After extracting java archive file, we just need to set up use newer version of java using alternatives.

[root@namenode01 ~]# cd /opt/jdk1.8.0_101/

 

[root@namenode01 ~]#  alternatives –install /usr/bin/java java /opt/jdk1.8.0_101/bin/java 2

[root@namenode01 ~]#  alternatives –config java

 

There are 3 programs which provide ‘java’.

Selection    Command

———————————————–

*  1           /opt/jdk1.7.0_71/bin/java

+ 2           /opt/jdk1.8.0_45/bin/java

3           /opt/jdk1.8.0_91/bin/java

4           /opt/jdk1.8.0_101/bin/java

 

Enter to keep the current selection[+], or type selection number: 4

 

At this point JAVA 8 has been successfully installed on your system. We also recommend to setup javac and jar commands path using alternatives

[root@namenode01 ~]# alternatives –install /usr/bin/jar jar /app/jdk1.8.0_101/bin/jar 2

[root@namenode01 ~]# alternatives –install /usr/bin/javac javac /app/jdk1.8.0_101/bin/javac 2

[root@namenode01 ~]# alternatives –set jar /app/jdk1.8.0_101/bin/jar

[root@namenode01 ~]# alternatives –set javac /app/jdk1.8.0_101/bin/javac

 

Check Installed Java Version:

[root@namenode01 ~]#  java –version

java version “1.8.0_101”

Java(TM) SE Runtime Environment (build 1.8.0_101-b14)

Java HotSpot(TM) 64-Bit Server VM (build 25.101-b14, mixed mode

 

Configuring Environment Variables:

 

  • Setup JAVA_HOME variable

 

[root@namenode01 ~]# vim /root/.bashrc

export JRE_HOME=/app/jdk1.8.0_101/jre

 

  • Setup JRE_HOME Variable
[root@namenode01 ~]# vim /root/.bashrc

export JRE_HOME=/app/jdk1.8.0_101/jre

 

  • Setup PATH Variable
[root@namenode01 ~]# vim /root/.bashrc

export PATH=$PATH:/app/jdk1.8.0_101/bin:/app/jdk1.8.0_101/jre/bin

 

Installation of hadoop-2.7.3

Download hadoop latest available version from its official site at hadoop namenode only.

Create one directory and extract the hadoop tar file in it, after extracting the tar file  install the extracted file using tar –xvzf command.

[root@namenode01 ~]# mkdir /app

[root@namenode01 ~]# cd /app

[root@namenode01 ~]# wget http://apache.claz.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz

[root@namenode01 ~]#tar –xvzf hadoop-2.7.3.tar

 

Setup environment variables:

  • Set environment variables in /.bashrc configuration file.
[root@namenode01 ~]# vim /root/.bashrc

 

export HADOOP_HOME=/app/hadoop-2.7.3

export HADOOP_INSTALL=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export HADOOP_CONF_DIR=/app/hadoop-2.7.3/etc/hadoop

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

  

Exports the variables

  • Now apply all the changes into the current running system.
  • To edit this open ~.bashrc configuration file.
[root@namenode01 ~]# source /root/.bashrc

 

Edit configuration files:

Core-site.xml:

          Core-site.xml is as configuration fie in Hadoop where you keep all your HDFS related configurations.

E.g:  namendoe host and port, the local directory where namenode related stuff be saved etc.

[root@namenode01 ~]# vim core-site.xml

 

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

</property>

 

Hdfs-site.xml:

  • The hdfs-site.xml file contains the configuration settings for HDFS daemons, the Namenode, the Secondary Namenode, and the Datanodes, here, we can configure hdfs-site.xml to specify default block replication and permission checking on HDFS
  • The actual number of replications can also specified when the file is created.
  • The default is used if replication is not specified in create time.
[root@namenode01 ~]#vim hdfs-site.xml

 

 

<property>

<name> dfs.namenode.name.dir</name>

<value>file:/app/hadoop-2.7.3/hdfs/name</value>

</property>

 

<property>

<name>dfs.datanode.data.dir</name>

<value>file:/app/hadoop-2.7.3/hdfs/data</value>

</property>

<property>

<name>dfs.replication</name>

<value>2</value>

</property>

 

Mapred-site.xml

  • The mapred-site.xml file contains the configuration settings for Mapreduce daemons,
    • The job tracker
    • The task-tracker
[root@namenode01 ~]#vim mapred-site.xml

 

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

 

Yarn-site.xml:

YARN is the component responsible for allocating containers to run tasks, coordinating the execution of said tasks, restart them in case of failure, among other housekeeping. Jusk like HDFS, and it also has 2 main components.

  • A ResourceManager which keeps track of the cluster resources and
  • A Nodemanager in each of the nodes which communicates with the ResourceManager and sets up conatiners for execution of tasks.

 

 

 [root@namenode01 ~]# vim yarn-site.xml

 

<property>

<name>yarn.resourcemanager.resource-tracker.address</name>

<value>192.168.1.41:8025</value>

</property>

<property>

<name>yarn.resourcemanager.scheduler.address</name>

<value>192.168.1.41:8030</value>

</property>

<property>

<name>yarn.resourcemanager.address</name>

<value>192.168.1.41:8050</value>

</property>

 

<property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.shuffleHandler</value>

</property>

<property>

<name>yarn.nodemanager.disk-healh-checker.min-healthy-disks</name>

<value>0</value>

</property>

 

HDFS Operations

Format namenode:

Format the configured HDFS file system, open namenode or HDFS server, and execute the following command.

[root@namenode01 ~]# hdfs namenode01 –format

 

Now run start-dfs.sh script:

After formatting the namenode, start the distributed file system. The following command will start the namenode as well as the data nodes as cluster.

[root@namenode01 ~]# start-dfs.sh

Now run start-yarn.sh script:

After starting the DFS, start the yarn file, the following command will start the namenode as well as the data nodes as cluster.

[root@namenode01 ~]# start-yarn.sh

 

Be the first to comment on "HADOOP"

Leave a comment

Your email address will not be published.


*