Sunday, January 29, 2012

How to run apache hadoop cluster program

What is apache hadoop which is project develops open-source software for reliable, scalable, distributed computing.The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers.

When we are setting up the apache hadoop in practically bring up the lots of problems Because there are less correct document in hadoop for some time it depend on using operating system.All of batch members were faced lots of problems for that finally Thilina and I configured apache hadoop cluster setup within 4 hours.So I try to give that knowledge to others because It help for you.This tutorial can be divided two parts which are set ssh connection each nodes and configure the hadoop.

Resources

Ubuntu 11.04,11.10
hadoop-0.20.2.tar.gz
JDK 1.6.0_24

Please doing all the thing according to this order

1.Install the ssh and rsync package to your machine

    $ sudo apt-get install ssh
    $ sudo apt-get install rsync

if it is getting some problem you will update your machine repository like this  and run above code again

    $sudo apt-get update

2 First you need to disable the firewall because some process may be used  the port of machine result of that hard to configure hadoop.

    $sudo ufw disable 

3.create user group using super user  

    $su root
    $sudo groupadd hadoop_user


4.create user Hadoop and assign that user to created user group 

     $sudo useradd --home-dir /home/hadoop --create-home --shell /bin/bash -U hadoop
     $sudo usermod -a -G hadoop_user hadoop 


5.create password for created user account what ever you wan,enter twice your password

    $passwd hadoop


6.Then check whether it is working 

    $su hadoop

7.Restart the machine and login hadoop account

8.Genarate the key pairs to the machine

    $ ssh-keygen -t rsa
 
after enter key press,create pub key and private key in ssh for hadoop account,Do the above orders in master node and slave machine also.



9.write the public key to one place to anoter in same machine

    $cat /home/hadoop/.ssh/id rsa.pub >> /home/hadoop/.ssh/authorized keys

10.copy the master publick key to all slave node

    $scp /home/hadoop/.ssh/id rsa.pub IPADDRESS of slave:/home/hadoop/.ssh/master.pub


11.Then login each slave node from master node and run this(write the master pub key in slave node to authorized_keys of slave node )

    $cat /home/hadoop/.ssh/master.pub >> /home/hadoop/.ssh/authorized_keys

12.check whether now you can login to slave machine without password and to your localhost also

    $ssh IPADDRESS of slave
    $ssh localhost


13.Login to hadoop account and create project folder in hadoop home account

    $mkdir -p /home/hadoop/project

14.install the haddop in to hadoop folder(copy the hadoop-0.20.0.tar.gz in to project folder)
     
    $cd /home/hadoop/project
    $tar -xzvf ./hadoop-0.20.0.tar.gz


15change the environmental variable in .profile file or .bashrc file

    export JAVA HOME=/home/hadoop/jdk1.6.0_24(if you install the jdk externally )
    export HADOOP_HOME=/home/hadoop/project/hadoop-0.20.2
    export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH 


if you didn't install JDK in externally to remove first line of above


16.change the JAVA_HOME environmental variable in hadoop/conf/hadoop-env.sh
     uncoment and change the java home according to java location

    export JAVA_HOME=/home/hadoop/jdk1.6.0_24

17.configure the hadoop parameters in conf.xml in conf folder
change the HADOOP HOME/src/core/core-default.xml.If you use masternode ip address for this.It's ok instead of this use domain name you need to change the hosts file in the /etc/ folders

    <configuration>
        <property>
               <name>fs.default.name</name>
                <value>hdfs://192.168.10.2:9000/</value>
        </property>
        <property>
               <name>hadoop.tmp.dir</name>
               <value>/tmp/hadoop-${user.name}</value>
        </property>

   </configuration>

change the HADOOP HOME/src/core/hdfs-default.xml

      <configuration>
            <property>
                  <name>dfs.replication</name>
                  <value>2</value>
            </property>
            <property>
                <name>dfs.block.size</name>
                <value>128000000</value>
            </property>
            <property>
                <name>dfs.permissions</name>
                <value>true</value>
            </property>
    </configuration>


change the HADOOP HOME/src/core/mapred-default.xml.you can run the job tracker in different machine if you want you need to change only ip address of job tracker

    <configuration>
          <property>
             <name>mapred.job.tracker</name>
             <value>hdfs://192.168.10.2:9001</value>
          </property>
          <property>
             <name>mapred.child.java.opts</name>
             <value>-Xmx512m</value>
          </property>

    </configuration> 

After changing xml files then put the master IP Address in master file in conf folder and also put the slaves IP Address in slave file in conf folder

18.After configure the master node you can copy the hadoop to other machine.

      
    $scp -r /home/hadoop/project IPADDRESS of slave:/home/hadoop/

run above code for each nodes changing ip address of slave

19.Now you need to change the environtal varible in each slave node as above way in .bashrc file or .profle and hadoop/conf/hadoop-env.sh file.

20.Now login to master node as hadoop acount and format the namenode(go to inside bin)

    $hadoop namenode -format


21.run the server

    $start-all.sh

22.Run your jar file in hadoop cluster 

    $hadoop jar /home/hadoop/smscount.jar org.sms.SmsCount /home/hadoop/smscount/input  /home/hadoop/smscount/output 

 Before it is running we need to know how to handle HDFS  folders
                 
     Create folder :  $hadoop dfs -mkdir /home/hadoop/smscount/input
     List file and folder :  $hadoop dfs -ls /home/hadoop/smscount/input
     Remove folder :  $hadoop dfs -rmr /home/hadoop/smscount/input
     put file : $hadoop dfs -put /home/hadoop/ file1 /home/hadoop/smscount/input
          
After running smscount jar automatically create hdfs output folder remember if you need to run this twice remove the output folder because it is throwing the exception.

23.To whatch the result

    $hadoop dfs -cat /home/hadoop/smscount/output/part-00000
 
24.stop the server

    $stop-all.sh


25.if it is running you can check following links and enjoy it.

    Hadoop Distributed File System (HDFS):
        http://IPADDRES of namenode:50070

    Hadoop Jobtracker:
        http://IPADDRES of jobtracker:50030

    Hadoop Tasktracker:
        http://IPADDRES of map-reduce processor:50060