Hadoop Introduction
Why hadoop is needed?
- Social network websites or ecommerce websites track customer behavior on the website and the server relevant information / product.
- Any global bank today has more than 100 million customers doing billions of transactions every month.
Traditional system find it difficult to cope up with this scale at required pace in cost-efficient manner
This is where big data platforms come to help. In this article, we introduce you to thee mesmerizing world of hadoop, hadoop come handy when we deal with enormous data. It may not make the process faster, but gives ys the capability to use parallel processing capability to handle big data. In short, hadoop gives us capability to deal with the complexities of high volume, velocity and variety of data. (popularly known as 3Vs)
Introduction:
Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
Hadoop is a complete eco-system of open source projects that provide us the framework to deal with big data.
Pre-installation setup
Step 1: Plan your cluster
Namenode01
Namenode02 Namenode03 Secondarynamenode Datanode01 Datanode02 Datanode03 Datanode04 Datanode05 Datanode06 |
Namenode:
- Manages the file system namespace.
- Regulates clients access to files.
- It also executes file system operation such as renaming, closing, and opening files and directories.
Datanode:
- Datanode perform read-write operations on the file systems, as per client request.
- They also perform operations such as block creation, deletion, and replication according to the instruction of the name node.
Step 2: Hadoop requirements.
You must make sure that the minimum hardware and software requirements are met.
Hardware requirements
Before you install Hadoop, you must make sure that minimum hardware requirements are met.
Minimum hardware requirements for the namenode node:
- 25GB free disk space.
- 2GB of physical memory (RAM).
- 2 static Ethernet configured interface.
Minimum hardware requirements for datanode nodes:
- 25 GB free disk space.
- 2 GB of physical memory (RAM).
- 2 static Ethernet configured interface.
Software requirements
one of the following operating system is required.
- Red Hat Enterprise Linux (RHEL)6.5 x86 (64bit).
- CentOS 6.8 x86 (64 bit).
Step 4: Hadoop environment
Configure the Interfaces
Here we need two Network Interface Cards (NIC), in every node of cluster need three NICs, so commonly while installing we create the Network Interface Cards (NIC).
#ifcfg-eth0
[root@node01 ~]# ifconfig eth0 eth0 Link encap:Ethernet HWaddr 00:0C:29:C6:CA:2A inet addr:192.168.1.21 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::20c:29ff:fec6:ca2a/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:84 errors:0 dropped:0 overruns:0 frame:0 TX packets:43 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:7891 (7.7 KiB) TX bytes:8067 (7.8 KiB) |
#ifcfg-eth1
[root@node01 ~]# ifconfig eth1 eth1 Link encap:Ethernet HWaddr 00:0C:29:C6:CA:34 inet addr:192.168.2.21 Bcast:192.168.2.255 Mask:255.255.255.0 inet6 addr: fe80::20c:29ff:fec6:ca34/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:38 errors:0 dropped:0 overruns:0 frame:0 TX packets:10 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2651 (2.5 KiB) TX bytes:636 (636.0 b) |
Network configuration
Here we are assign the IP address to all interfaces.
#vim /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0 HWADDR=00:0c:29:c6:ca:2a TYPE=Ethernet UUID=f9ca22db-53c2-415f-9b20-979176564aa9 ONBOOT=no NM_CONTROLLED=yes BOOTPROTO=none IPADDR=192.168.1.21 NETMASK=255.255.255.0 IPV6INIT=no USERCTL=no
|
#vim /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1 HWADDR=00:0c:29:c6:ca:34 TYPE=Ethernet UUID=b65b93be-7086-40bf-8ddc-3515464de764 ONBOOT=no NM_CONTROLLED=yes BOOTPROTO=none IPADDR=192.168.2.21 NETMASK=255.255.255.0 IPV6INIT=no USERCTL=no |
Stop the Firewall
#service iptables stop
#chkconfig iptables off #iptables –F #service ip6tables stop #chkconfig ip6tables off #ip6tables -F #service NetworkManager stop #chkconfig NetworkManager off |
Step 5: Setup Hostname and FQDN
In this step we will assign the hostnames to all nodes, and here we are create Fully Qualified Domain Name (FQDN).
- Get nodes properly on the networks
- Ensures that the hostnames and IP address are correct and correctly recorded in DNS entries and/or in hosts entries.
- Edit /etc/hosts file on all master and slave servers.
[root@node01 ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.31 namenode01.unixadmin.in namenode01 192.168.1.32 namenode02.unixadmin.in namenode02 192.168.1.33 datanode01.unixadmin.in datanode01 192.168.1.34 datanode02.unixadmin.in datanode02
|
Step 6: Configure SSH auto login
- SSH setup is required to do different operations on a cluster such as starting, stopping, distributed daemon shell operations, to authenticate different users of hadoop, it is required to provide public / private key pair for a hadoop user and share is with different users.
- Verify node-to-node rsh/ssh communication over the nodes hostname as well as fully qualified domain name (FQDN), including rsh/ssh to self
[root@namenode01 ~]# ssh-keygen
Generating public/private rsa key pair. Enter file in which to save the key (/root/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /root/.ssh/id_rsa. Your public key has been saved in /root/.ssh/id_rsa.pub. The key fingerprint is: 10:a9:10:7e:50:3c:db:f6:10:43:17:87:de:9b:18:cc root@bridge.unixadmin.in The key’s randomart image is: +–[ RSA 2048]—-+ | o+..o.oo. | | …o +o.. | | …=.* . | | .o +.E . | | . oSo o | | o o | | | | | | | +—————–+ |
[root@namenode01 ~]# ssh-copy-id localhost
The authenticity of host ‘localhost (::1)’ can’t be established. RSA key fingerprint is cf:a1:ab:1c:be:a6:2b:ba:94:64:db:df:bd:52:a5:67. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added ‘localhost’ (RSA) to the list of known hosts. root@localhost’s password: Now try logging into the machine, with “ssh ‘localhost'”, and check in:
.ssh/authorized_keys
to make sure we haven’t added extra keys that you weren’t expecting. |
[root@namenode01 ~]# ssh-copy-id 192.168.1.21
root@192.168.1.21’s password: Now try logging into the machine, with “ssh ‘192.168.1.21’”, and check in:
.ssh/authorized_keys
to make sure we haven’t added extra keys that you weren’t expecting. |
Java Installation:
During installation of java using rpm file I found issues many times. after that I found a better way to install java from sun site, using below steps I have installed java successfully many times without facing any issues, we can also install multiple version of java easily if required.
Downloading latest java Archive:
[root@namenode01 ~]# wget –“http://download.orcale.com/otn-pub/java/jdk/8u101-b14/jdk-8u101-linux-x64.tar.gz”
# tar –xvzf jdk-8u101-linux-x64.tar.gz |
Install java with alternatives:
After extracting java archive file, we just need to set up use newer version of java using alternatives.
[root@namenode01 ~]# cd /opt/jdk1.8.0_101/ |
[root@namenode01 ~]# alternatives –install /usr/bin/java java /opt/jdk1.8.0_101/bin/java 2
[root@namenode01 ~]# alternatives –config java |
There are 3 programs which provide ‘java’.
Selection Command ———————————————– * 1 /opt/jdk1.7.0_71/bin/java + 2 /opt/jdk1.8.0_45/bin/java 3 /opt/jdk1.8.0_91/bin/java 4 /opt/jdk1.8.0_101/bin/java
Enter to keep the current selection[+], or type selection number: 4 |
At this point JAVA 8 has been successfully installed on your system. We also recommend to setup javac and jar commands path using alternatives
[root@namenode01 ~]# alternatives –install /usr/bin/jar jar /app/jdk1.8.0_101/bin/jar 2
[root@namenode01 ~]# alternatives –install /usr/bin/javac javac /app/jdk1.8.0_101/bin/javac 2 [root@namenode01 ~]# alternatives –set jar /app/jdk1.8.0_101/bin/jar [root@namenode01 ~]# alternatives –set javac /app/jdk1.8.0_101/bin/javac
|
Check Installed Java Version:
[root@namenode01 ~]# java –version
java version “1.8.0_101” Java(TM) SE Runtime Environment (build 1.8.0_101-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.101-b14, mixed mode |
Configuring Environment Variables:
- Setup JAVA_HOME variable
[root@namenode01 ~]# vim /root/.bashrc
export JRE_HOME=/app/jdk1.8.0_101/jre |
- Setup JRE_HOME Variable
[root@namenode01 ~]# vim /root/.bashrc
export JRE_HOME=/app/jdk1.8.0_101/jre |
- Setup PATH Variable
[root@namenode01 ~]# vim /root/.bashrc
export PATH=$PATH:/app/jdk1.8.0_101/bin:/app/jdk1.8.0_101/jre/bin |
Installation of hadoop-2.7.3
Download hadoop latest available version from its official site at hadoop namenode only.
Create one directory and extract the hadoop tar file in it, after extracting the tar file install the extracted file using tar –xvzf command.
[root@namenode01 ~]# mkdir /app
[root@namenode01 ~]# cd /app [root@namenode01 ~]# wget http://apache.claz.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz [root@namenode01 ~]#tar –xvzf hadoop-2.7.3.tar |
Setup environment variables:
- Set environment variables in /.bashrc configuration file.
[root@namenode01 ~]# vim /root/.bashrc
export HADOOP_HOME=/app/hadoop-2.7.3 export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=/app/hadoop-2.7.3/etc/hadoop export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin |
Exports the variables
- Now apply all the changes into the current running system.
- To edit this open ~.bashrc configuration file.
[root@namenode01 ~]# source /root/.bashrc |
Edit configuration files:
Core-site.xml:
Core-site.xml is as configuration fie in Hadoop where you keep all your HDFS related configurations.
E.g: namendoe host and port, the local directory where namenode related stuff be saved etc.
[root@namenode01 ~]# vim core-site.xml
<property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> |
Hdfs-site.xml:
- The hdfs-site.xml file contains the configuration settings for HDFS daemons, the Namenode, the Secondary Namenode, and the Datanodes, here, we can configure hdfs-site.xml to specify default block replication and permission checking on HDFS
- The actual number of replications can also specified when the file is created.
- The default is used if replication is not specified in create time.
[root@namenode01 ~]#vim hdfs-site.xml
<property> <name> dfs.namenode.name.dir</name> <value>file:/app/hadoop-2.7.3/hdfs/name</value> </property>
<property> <name>dfs.datanode.data.dir</name> <value>file:/app/hadoop-2.7.3/hdfs/data</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> |
Mapred-site.xml
- The mapred-site.xml file contains the configuration settings for Mapreduce daemons,
- The job tracker
- The task-tracker
[root@namenode01 ~]#vim mapred-site.xml
<property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> |
Yarn-site.xml:
YARN is the component responsible for allocating containers to run tasks, coordinating the execution of said tasks, restart them in case of failure, among other housekeeping. Jusk like HDFS, and it also has 2 main components.
- A ResourceManager which keeps track of the cluster resources and
- A Nodemanager in each of the nodes which communicates with the ResourceManager and sets up conatiners for execution of tasks.
[root@namenode01 ~]# vim yarn-site.xml
<property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>192.168.1.41:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>192.168.1.41:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>192.168.1.41:8050</value> </property>
<property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.shuffleHandler</value> </property> <property> <name>yarn.nodemanager.disk-healh-checker.min-healthy-disks</name> <value>0</value> </property> |
HDFS Operations
Format namenode:
Format the configured HDFS file system, open namenode or HDFS server, and execute the following command.
[root@namenode01 ~]# hdfs namenode01 –format |
Now run start-dfs.sh script:
After formatting the namenode, start the distributed file system. The following command will start the namenode as well as the data nodes as cluster.
[root@namenode01 ~]# start-dfs.sh |
Now run start-yarn.sh script:
After starting the DFS, start the yarn file, the following command will start the namenode as well as the data nodes as cluster.
[root@namenode01 ~]# start-yarn.sh |
Be the first to comment on "HADOOP"