Big Data Board

Saturday, July 11, 2015

Storm Installation on CentOs

Hi Folks, Storm is one of best real-time processing System now a days, companies have started using it on large scale. It is one of the best distributed real-time computation System similarly like spark.

You might have heard the integration of storm with many data-pipe line for data processing, Its having one unique feature which is different from other and it is once its start it will run like forever until you kills it.

Basic difference from hadoop is, it is for realtime processing unlike hadoop which is for batch processing.

So lets see how we can get it on our system up and running.

Step 1 :- Download the storm from official apache site and unzip it you will find the couple of folder and storm jar

$ wget https://github.com/downloads/nathanmarz/storm/storm-0.8.1.zip
$ unzip storm-0.8.1.zip

[storm@kafka ~]$ ll storm-0.8.1
total 4780
drwxr-xr-x. 2 storm storm    4096 Sep 6 2012 bin
-rw-r--r--. 1 storm storm   19981 Sep 6 2012 CHANGELOG.md
drwxr-xr-x. 2 storm storm    4096 Jun 25 08:19 conf
drwxrwxr-x. 4 storm storm    4096 Jun 25 05:24 data
drwxr-xr-x. 2 storm storm    4096 Sep 6 2012 lib
-rw-r--r--. 1 storm storm   12710 Sep 6 2012 LICENSE.html
drwxr-xr-x. 2 storm storm    4096 Sep 6 2012 log4j
drwxr-xr-x. 2 storm storm    4096 Jun 25 08:21 logs
-rw-------. 1 storm storm   25640 Jun 25 07:32 nohup.out
drwxr-xr-x. 4 storm storm    4096 Sep 6 2012 public
-rw-r--r--. 1 storm storm    3730 Sep 6 2012 README.markdown
-rw-r--r--. 1 storm storm       6 Sep 6 2012 RELEASE
-rw-r--r--. 1 storm storm 4789764 Sep 6 2012 storm-0.8.1.jar

Step 2:- Download zeromq from its official site

$ wget http://download.zeromq.org/zeromq-2.1.7.zip

$ unzip zeromq-2.1.7.zip

you will see bunch of file and make file , now you need to build it through mvn

$ cd zeromq-2.1.7

$ ./configure && make

if its failing during build you need to run below commands to install required libraries

sudo yum install libuuid*
sudo yum install uuid-*
sudo yum install gcc-*
sudo yum install git
sudo yum install libtool*

Step 3: After configuration of zeromq we need jzmq from git.

$ git clone https://github.com/nathanmarz/jzmq.git
$ ./Makefile.am
$ sed -i 's/classdist_noinst.stamp/classnoinst.stamp/g' src/Makefile.am
$ ./autogen.sh
$ ./configure && make install

Step 4: Download the zookeeper from its official site

$ wget http://www.webhostingreviewjam.com/mirror/apache/zookeeper/stable/zookeeper-3.4.6.tar.gz
$ unzip zookeeper-3.4.6.tar.gz
$ mkdir zookeeper-3.4.6/data

Now update the zoo.conf with data folder and port no.

dataDir=~/data
# the port at which the clients will connect
clientPort=2181

Step 5: Update the storm configuration file and make some entries

$ vi storm.yaml

########### These MUST be filled in for a storm configuration
storm.zookeeper.servers:
- "192.168.99.141" // your ip address
storm.zookeeper.port: 2181
nimbus.host: "192.168.99.141" // your ip address
nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
supervisor.childopts: "-Djava.net.preferIPv4Stack=true"
worker.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
nimbus.thrift.port: 8627
ui.port: 8772
storm.local.dir: "/home/storm/storm-0.8.1/data" // your data dir path
java.library.path: "/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703

Step 6:- Start the services nimbus, supervisor , ui

before that add below to .profile file

export STORM_HOME="/home/storm/storm-0.8.1"
export JAVA_HOME="/usr"
export PATH=$STORM_HOME/bin:$JAVA_HOME/bin:$PATH
export ZOOKEEPER_HOME="/home/zookeeper/zookeeper-3.4.6"
export PATH=$ZOOKEEPER_HOME/bin:$PATH

Now start the services in background.

$ zkServer.sh start
$ nohup storm nimbus &
$ nohup storm supervisor &
$ nohup strom ui &

Now you can see the services running like

[storm@kafka ~]$ jps
3354 core
3247 nimbus
3440 Jps
3332 supervisor
3083 QuorumPeerMain

You can view the web ui at http://localhost:8772

Monday, July 6, 2015

Kafka Implementation on Centos

Hi Folks , here i am going to show how to implement the kafka on linux host and to make use of it, kafka installation is pretty simple and can be done in matter of few mins.

Lets first understand what is kafka and why we are using it, some of basic stuff about kafka.

Kafka:- It is a distributed messaging system, where we have different components to produce / public and consume the messages stream. It is fault tolerant, consistent high performance throughput , very helpful even if you have to process live stream of TB's data. So lets see its architecture.

Zookeeper:- As we all know that it is used for maintain the state of the process and jobs, in this case it is used for maintaining and updating the consumed message offset/storing the broker address etc. Zookeeper is required to run kafka on machine.

Producer:- Producer create the topics and sent the message of that topics to broker for further processing.

Broker:- Broker stores the data written by producers, its can store multiple read and write a time.

Consumer:- Consumer polls the messages from broker and use it.

Installation of kafka

1. Download the kafka from their official site or click here.

2. Untar it some place like i did on my user's home.

$ tar -xvzf kafka_2.10-0.8.2.0.tgz

$ mv kafka_2.10-0.8.2.0 kafka-0.8.2

3. Now you have start the zookeeper which is very important for running kafka producer , so just run the below commands

$ cd kafka-0.8.2

$ bin/zookeeper-server-start.sh /home/kafka/kafka-0.8.2/config/zookeeper.properties

note:- you need to edit your zookeeper.properties to define the data directory.

4. Now the zookeeper is started you can start your kafka server , which is pretty simple to execute

$ bin/kafka-server-start.sh /home/kafka/kafka-0.8.2/config/server.properties

5. Now the server is start , we can create a sample topic and public it to broker so lets create a sample topic

$ bin/kafka-topics.sh --create --zookeeper kafka:2181 --replication-factor 1 --partitions 1 --topic test

where kafka:2181 is zookeeper host and port , replication factor/partition is 1 and topic name is test

6. Now you can check the topic is created or not by below command

$ bin/kafka-topics.sh --list --zookeeper kafka:2181

7. Now there are many ways to feed the data like command line and from Automated live feed

// you can feed the data from command line by below method.

$ bin/kafka-console-producer.sh --broker-list kafka:9092 --topic test

8. Now you can read the data whatever you have given in above commands

$ bin/kafka-console-consumer.sh --zookeeper kafka:2181 --topic test --from-beginning

output will the same words or data given while commands 7.

9. You can see the details of the topic you have created by below commands

$ bin/kafka-topics.sh --describe --zookeeper kafka:2181 --topic my-replicated-topic

That is all for now, please comment if you have any queries and doubts.

Saturday, September 27, 2014

Spark 1.0.x on Yarn

Hi Folks, i have tried to setup spark on hadoop 2.x which is with yarn. It was great becoz spark supports not only map reduce but yarn and other paradigm also.

Will write about spark on my next blog, let me show you how can we set up spark on yarn cluster.

1. Created four node cluster with HDP2.x and MRv2(yarn)

here is the link to install it on mac or same step you can follow to install on any linux system.

Ex.


IP	Role
192.168.1.101(Node1)	ActiveNameNode, RM
192.168.1 .102(Node2)	StandbyNameNode, Master, Worker
192.168.1 .103(Node3)	DataNode, Worker
192.168.1 .104(Node4)	DataNode, Worker

2. Spark common installation and deployment model has Spark On Yarn and Standalone, can be used simultaneously.

In spark we have master and worker which runs on the cluster to perform the tasks on yarn. here we have run master on Node2 and work on other like node2,3,4.

You need to download correct versions of MRv2 and Spark they should be comparable with each other else you would be running in to compatibility issue.

so download the spark and place inside any directory like in this case i put it inside home directory on mapred user.

$ /home/mapred/spark-1.0.2-bin-hadoop2.tar

$ cd /home/mapred/ ; tar -xvf spark-1.0.2-bin-hadoop2.tar

3. Deploy this model, you need to modify the spark-env.sh file conf directory.
Add the following configuration options in which:

export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/usr/lib/hadoop/lib/native/
export HADOOP_HOME = /usr/lib/hadoop
export HADOOP_CONF_DIR= $HADOOP_HOME/etc/hadoop
export SPARK_EXECUTOR_INSTANCES = 2
export SPARK_EXECUTOR_CORES = 1
export SPARK_EXECUTOR_MEMORY = 400M
export SPARK_DRIVER_MEMORY = 400M
export SPARK_YARN_APP_NAME = "Spark 1.0.2"
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/lib/hadoop/lib/hadoop-lzo-0.5.0.jar:/usr/lib/hadoop/lib/

4. Copy the same configuration on all worker nodes and start the services like below.

On Namenode (node2)

$ $SPARK_HOME/sbin/start-master.sh

On Datanode/Slaves nodes

$ ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://node2:7077 &

5. Now you can run your sample pi program

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 /home/mapred/spark1.0.2/lib/spark-examples-1.0.2-hadoop2.2.0.jar 10

You can check the result on JT URI

Sunday, September 14, 2014

Single disk issue in Hadoop Cluster.

Hi Folks, Recently i have performed simple test on hadoop cluster. we have a pretty large cluster with so much of data, each datanode is of around 24 TB hdd(12, 2tb disks). let me tell what is the general issue we faced and how did we resolved it.

ISSUE:- 1 or 2 Disk is getting full like 90% and other disks of the same node are between 50-60%, due to which we were getting continuos alerts. Its was getting pain for us becoz it was happening frequently becoz of large size of the cluster.

Resolution:- We have tried some test and finally got successful to cope up with this situation, let me tell what we have performed and how did we resolved it.

1. I have created a text file of 142mb with 10000000 records and copied it into hdfs.

[hdfs@ricks-01 13:21:01 ~]$ hadoop fs -cat /user/hdfs/file | wc -l

10000000

2. Set its replication factor to 1 so that only one replication would be present on the cluster.

[hdfs@ricks-01 12:13:07 ~]$ hadoop fs -setrep 1 /user/hdfs/file

Replication 1 set: /user/hdfs/file

3. Now run fsck to check the location of its block on the datanode.

[hdfs@ricks-01 12:14:32 ~]$ hadoop fsck /user/hdfs/file -files -blocks -locations

/user/hdfs/file 148888897 bytes, 2 block(s):  OK

0. BP-89257919-1406754396842:blk_1073745304_4480 len=134217728 repl=1 [17.170.204.86:1004]

1. BP-89257919-1406754396842:blk_1073745305_4481 len=14671169 repl=1 [17.170.204.86:1004]

4. Check the host where it is present and found that its present on ricks-04 node

[hdfs@ricks-01 12:14:38 ~]$ host 17.170.204.86

86.204.170.17.in-addr.arpa domain name pointer ricks-04.

5. now i have to login into that node and find the block and mv it to another disks at the same path.

[hdfs@ricks-04 12:15:55 ~]$ find /ngs*/app/hdfs/hadoop/ -name blk_1073745305

/disk2/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/blk_1073745305

[hdfs@ricks-04 12:16:03 ~]$ mv /disk2/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/blk_1073745305*  /disk5/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/

6. Now again search the block so that its confirmed that there is no other replication is present and its copied to new location which is disk5.

[hdfs@ricks-04 12:17:48 ~]$ find /disk*/hdfs/ -name blk_1073745305*

/disk5/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/blk_1073745305

/disk5/hdfs/dfs/dn/current/BP-89257919-1406754396842/current/finalized/blk_1073745305_4481.meta

7. Now run the fsck so that it register its new location to the namenode.

[hdfs@ricks-01 13:02:21 ~]$ hadoop fsck /user/hdfs/file -files -blocks -locations

/user/hdfs/file 148888897 bytes, 2 block(s):  OK

0. BP-89257919-1406754396842:blk_1073745304_4480 len=134217728 repl=1 [17.170.204.86:1004]

1. BP-89257919-1406754396842:blk_1073745305_4481 len=14671169 repl=1 [17.170.204.86:1004]

8. Now again run the hdfs command and check the count of the file if its giving the right count that its get register with the namenode if it not giving the right count then wait for sometime and try again.

[hdfs@ricks-01 13:21:01 ~]$ hadoop fs -cat /user/hdfs/file | wc -l

10000000

Point to remember

Be Extra careful while moving the block from one place to another, you might want to take backup before moving it.
sure that there is not jobs running on it at that point of time, you can kill the TT before doing that.
You can restart your datanode service, after performing this test.

Hadoop Cluster Disaster Recovery Solution 2/2

Hi Folks, In our last blog we have discussed about the synchronous data replication across the cluster which is pretty much expensive in term of network and performance. Today we will talk about the asynchronous data replication which less expensive then the previous one.

So lets start, how we can go with asynchronous data replication, what kind or design we required to setup that and how we will make this work.

In above picture we can see the designing of cross data replication, let focus how it works between two clusters.

When a client is writing a HDFS file, after the file is created, it starts to request a new block. And the primary cluster Active NameNode will allocate a new block and select a list of DataNodes for the client to write to. For the file which needs only asynchronous data replication, no remote DataNode from mirror cluster is selected for the pipeline at Active NameNode.
As usual, upon a successful block allocation, the client will write the block data to the first DataNode in the pipeline and also giving the remaining DataNodes.
As usual, the first DataNode will continue to write to the following DataNode in the pipeline until the last. But this time the pipeline doesn’t span to the mirror cluster.
Asynchronously, the mirror cluster Active NameNode will actively schedule to replicate data blocks which are not on any of the local DataNodes. As part of heartbeats it will send MIRROR_REPLICATION_REQUEST which will contain batch of blocks to replicate with target DataNodes selected from mirror cluster. The mirror cluster doesn’t need to aware of real block location in primary cluster.
As a result of handling the MIRROR_REPLICATION_REQUEST, the primary cluster Active NameNode takes care of selecting block location and schedules the replication command to corresponding source DataNode at primary cluster.
A DataNode will be selected to replicate the data block from one of the DataNodes in primary cluster that hold the block.
As a result of the replication pipeline, the local DataNode can replicate the block to other DataNodes of the mirror cluster.

Asynchronous Namespace Journaling

Synchronous journaling to remote clusters means more latency and performance impact. When the performance is critical, the admin can configure an asynchronous edit log journaling.

As usual, the primary cluster Active NameNode writes the edit logs to Shared Journal of the primary cluster.
As usual, the primary cluster Standby NameNode tails the edit logs from Shared Journal of the primary cluster.

The mirror cluster Active NameNode tails the edit logs from Shared Journal of the primary cluster. And applies the edit logs to its namespace in memory.
After applying the edit logs to its namespace, the mirror cluster Active NameNode also writes the edit logs to its local Shared Journal.
As usual, the mirror cluster Standby NameNode tails the edit logs from Shared Journal of the mirror cluster.

Points to remember

Better performance and low letency then the synchronous data replication.
Chance of data loss while asynchronous data replication and primary went down.
Required when performance is critical then the data.

Hadoop Cluster Disaster Recovery Solution 1/2

Hi Folks, whenever we think about the cluster setup and design. We always think about DR how can we save our Data from cluster crash. Today we discuss about the disaster recovery plan for hadoop cluster, what would be steps we can take and how far we can save our data.

Type of Cluster design across data center

1. Synchronous Data Replication between cluster
2. ASynchronous Data Replication between cluster

lets talk about the Synchronous Data writing between cluster. Here is the pictorial view of data center design.

When a client is writing a HDFS file, after the file is created, it starts to request a new block. And the Active NameNode of primary cluster will allocate a new block and select a list of DataNodes for the client to write to. By using the new mirror block placement policy, the Active NameNode can guarantee one or more remote DataNodes from the mirror cluster are selected at the end of the pipeline.
The primary cluster Active NameNode knows the available DataNodes of the mirror cluster via heartbeats from mirror cluster’s Active NameNode with the MIRROR_DATANODE_AVAILABLE command. So, latest reported DataNodes will be considered for the mirror cluster pipeline which will be appended to primary cluster pipeline.
As usual, upon a successful block allocation, the client will write the block data to the first DataNode in the pipeline and also giving the remaining DataNodes.
As usual, the first DataNode will continue to write to the following DataNode in the pipeline.
The last local DataNode in the pipeline will continue to write the remote DataNode that following.
If there are more than one remote DataNodes are selected, the remote DataNode will continue to write to the following DataNode which is local to the remote DataNode. We provide flexibility to users that they can even configure the mirror cluster replication. Based on the configured replication, mirror nodes will be selected.

Synchronous Namespace Journaling

As usual, the primary cluster Active NameNode writes the edit logs to Shared Journal of the primary cluster.
The primary cluster Active NameNode also writes the edit logs to the mirror cluster Active NameNode by using a new JournalManager.
As usual, the primary cluster Standby NameNode tails the edit logs from Shared Journal of the primary cluster.
The mirror cluster Active NameNode writes the edit logs to Shared Journal of the mirror cluster after applying the edit logs received from the primary cluster.
As usual, the mirror cluster Standby NameNode tails the edit logs from Shared Journal of the mirror cluster.

Points to Remember

Synchronous Data writing is good when the data is very critical and we cant afford to lose consistency at any point of time.
It Actually increase the latency of hadoop data writing, which impact performance of the hadoop cluster.
Required more network bandwidth and stability to cope with synchronous replication.

Tuesday, April 22, 2014

Linux & Hadoop Uniq Commands

Hi Folks,

Today i am going to show you some of important commands which you can use for different purposes.

1. Data read and written by the particular process by providing pid of process

cat /proc/$pid/io | grep -wE "read_bytes|write_bytes" | awk -F':' '{print $1 " " $2/(1024*1024) " Mb"}'

2. Delete N nos of file

find . -name "*.gc" -print0 | xargs -0 rm

3. Generate random data for use cases Ex like 5 *10 MB files

dd if=/dev/urandom of=a.log bs=5M count=10

4. Replace spaces from file name

IFS=$'\n';for f in `find .`; do file=$(echo $f | tr [:blank:] '_'); [ -e $f ] && [ ! -e $file ] && mv "$f" $file; done;unset IFS;

5. Difference between fileA and fileB

awk 'BEGIN { while ( getline < "fileB" ) { arr[$0]++ } } { if (!( $0 in arr ) ) { print } }' fileA

6. Print the hostnames of datanodes by commandline (used when you have large no of nodes)

for a in `hadoop dfsadmin -report | grep -i name | awk -F ':' '{print $2}'`; do host $a| awk '{print $5}' | sed 's/.$//g'; done

7. Dfs % used of hadoop nodes

hadoop dfsadmin -report | grep -A6 Name | tr '\n' ' ' | tr '-' '\n' | awk '{print substr($2,0,13)" "$29}'

8. Read XML (format:- [hdfs|core|mapred]-site.xml) file from A to B

cat $fil | sed -n “/A/,/B/p"

9. Change XML (format:- [hdfs|core|mapred]-site.xml) to Yaml

cat hdfs-site.xml | grep -e "<name>" -e "<value>" | sed 's/<name>//g;s/<value>//g;s/<\/value>//g;s/<\/name>/:/g' | perl -p -e 's/:\n/:/'

10. Get the value of particular parameter of xml file (format:- [hdfs|core|mapred]-site.xml)

awk -F"[<>]" '/mapred.local.dir/ {getline;print $3;exit}'

Hope these are helpful to you :)

Wednesday, March 26, 2014

Common Error in Hadoop - Part 1

Common Error in Hadoop

Error:
10/01/18 10:52:48 INFO mapred.JobClient: Task Id : attempt_201001181020_0002_m_000014_0, Status : FAILED
java.io.IOException: Task process exit with nonzero status of 1.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

Reason:
1. Log Directory might be full, check for no of userlog Directories
2. Size of Log Directories

Solution:
1.Increase the ulimit of the log directory by adding
* hard nofile 10000 into /etc/security/limits.conf
2.Clear some Space by deleting some directories

Error:
Reducer is not starting after map completion like map is 100% and hang after that in pseudo mode.

Reason:
problem with /etc/hosts file

Solution:
1. Check for /etc/hosts and find if IP is given against Hostname,
if yes remove it and give the loopback address which is 127.0.0.1.

Error:
FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /home/hadoop/mydata/hdfs/
namenode is in an inconsistent state: storage directory does not exist or is not accessible.

Reason:
1.Hdfs Directory doesn't Exist or Dont have correct ownership or permissions.

Solution:
Create if not exist and correct the permission according to hdfs.

Error:
Job initialization failed: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device at

Reason:
1.Space was full on log directory of Jobtracker

Solution:
Clear up some space from log directory

Error:
Incompatible namespaceIDS in ...: namenode namespaceID = ..., datanode namespaceID = ...

Reason:
because the format namenode will re-create a new namespaceID, so that the original and datanode inconsistent.

Solution:
1. Data files deleted the datanode dfs.data.dir directory (default is tmp / dfs / data)
2. Modify dfs.data.dir / current / VERSION file the namespaceID and namenode identical to (log errors where there will be prompt)
3. To reassign new dfs.data.dir directory

Error:
Hadoop cluster is started with start-all.sh, slave always fail to start datanode, and will get an error:
Could only be replicated to 0 nodes, instead of 1

Reason:
Is the node identification may be repeated (personally think the wrong reasons). There may also be other reasons, and what solution then tries to solve.

Solution:
1. If port access, you should make sure the port is open, such as hdfs :/ / machine1: 9000 / 50030,50070 like. Executive # iptables-I INPUT-p tcp-dport 9000-j ACCEPT command. If there is an error: hdfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused in; datanode port can not access, modify iptables: # iptables-I INPUT-s machine1-p tcp-j datanode on ACCEPT
2. There may be firewall restrictions between clusters to communicate with each other. Try to turn off the firewall. / Etc / init.d / iptables stop
3. Finally, there may be not enough disk space, check df -al

Error:
The program execution
Error: java.lang.NullPointerException

Reason:
Null pointer exception, to ensure that the correct java program. Instantiated before the use of the variable what statement do not like array out of bounds. Inspection procedures.
When the implementation of the program, (various) error, make sure that the
situation:

Solution:
1. Premise of your program is correct by compiled
2. Cluster mode, the data to be processed wrote HDFS HDFS path and ensure correct
3. Specify the execution of jar package the entrance class name (I do not know why sometimes you do not specify also can run)
The correct wording similar to this:
$ hadoop jar myCount.jar myCount input output
4. Hadoop start datanode

Error:
Unrecognized option:-jvm Could not the create the Java virtual machine.

Reason:
Hadoop installation directory / bin / hadoop following piece of shell:

Solution:
CLASS = 'org.apache.hadoop.hdfs.server.datanode.DataNode'
   if [[$ EUID-eq 0]]; then
     HADOOP_OPTS = "$ HADOOP_OPTS-jvm server $ HADOOP_DATANODE_OPTS"
   else
     HADOOP_OPTS = "$ HADOOP_OPTS-server $ HADOOP_DATANODE_OPTS"
   fi
$ EUID user ID, if it is the root of this identification will be 0, so try not to use the root user to operate hadoop .

Error:
Terminal error message:
ERROR hdfs.DFSClient: Exception closing file / user / hadoop / musicdata.txt: java.io.IOException: All datanodes 10.210.70.82:50010 are bad. Aborting ...

There are the jobtracker logs the error information

Error register getProtocolVersion
java.lang.IllegalArgumentException: Duplicate metricsName: getProtocolVersion
And possible warning information:

WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Broken pipe
WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_3136320110992216802_1063java.io.IOException: Connection reset by peer
WARN hdfs.DFSClient: Error Recovery for block blk_3136320110992216802_1063 bad datanode [0] 10.210.70.82:50010 put: All datanodes 10.210.70.82:50010 are bad. Aborting ...

solution:
1. Path of under the dfs.data.dir properties of whether the disk is full, try hadoop fs -put data if the processing is full again.
2. Related disk is not full, you need to troubleshoot related disk has no bad sectors, need to be detected.

Error:
Hadoop jar program get the error message:
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.NullWritable, recieved org.apache.hadoop.io.LongWritable

Or something like this:

Status: FAILED java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

Solution:
Then you need to learn the basics of Hadoop and map reduce model. In "hadoop Definitive Guide book” in Chapter Hadoop I / O and in Chapter VII, MapReduce type and format. If you are eager to solve this problem, I can also tell you a quick solution, but this is bound to affect you later development:
Ensure consistent data:

    ... Extends Mapper ...
    public void map (k1 k, v1 v, OutputCollector output) ...
    ...
    ... Extends Reducer ...
    public void reduce (k2 k, v2 v, OutputCollector output) ...
    ...
    job.setMapOutputKeyClass (k2.class);
    job.setMapOutputValueClass (k2.class);
    job.setOutputKeyClass (k3.class);
    job.setOutputValueClass (v3.class);
    ...

Note that the corresponding k * and v *. Recommendations or two chapters I just said. Know the details of its principles.

Error:
If you hit a datanode error as follows:
ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Cannot lock storage / data1/hadoop_data. The directory is already locked.

Reason:
According to the error prompts view, it is the directory locked, unable to read. At this time you need to look at whether there are related process is still running or slave machine hadoop process is still running, use the linux command to view:

    Netstat -nap
    ps-aux | grep Related PID

Solution:
If hadoop related process is still running, use the kill command to kill can. And then re-use start-all.sh.

Error:
If you encounter the jobtracker error follows:
ERROR: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Solution:
modify datanode node /etc/hosts file.
Hosts under brief format:
Each line is divided into three parts: the first part of the network IP address, the second part of the host name or domain name, the third part of the host alias detailed steps are as follows:

1.first check the host name:

$ echo –e “ `hostname - i ` \t `hostname -n` \t $stn ”

Stn= short name or alies of hostname.

It will result in something like that

10.200.187.77             hadoop-datanode          DN

If the IP address is configured on successfully modified, or show host name there is a problem, continue to modify the hosts file,
The shuffle error still appears this problem, then try to modify the configuration file of another user said hdfs-site.xml file, add the following:
dfs.http.address
*. *. *: 50070 The ports do not change, instead of the asterisk IP hadoop information transfer through HTTP, the port is same.

Error:
If you encounter the jobtracker error follows:
ERROR: java.lang.RuntimeException: PipeMapRed.waitOutputThreads (): subprocess failed with code *

Reason:
This is a java throws the system returns an error code, the meaning of the error code indicates details.

Sunday, March 23, 2014

Hadoop Installation (type RPM )

Hi Folks,

Today we are going for RPM installation of hadoop. It is also pretty easy as my last hadoop installtion was , So lets try it out.

Requirement

Java JDK (download from here)
hadoop-0.20.204.0-1.i386.rpm (Download from here)

Installation

1. Installation of Java and set Java Home on /etc/profile by export JAVA_HOME=/usr

sudo ./jdk-6u26-linux-x64-rpm.bin.sh

2. Hadoop RPM installation

sudo rpm -i hadoop-0.20.204.0-1.i386.rpm

3. Setting up Single Node cluster

sudo /usr/sbin/hadoop-setup-single-node.sh

You will get many question while setting we up hadoop , like creation of directories and some configuration related, you need to give answer in y.

For MultiNode Setup You Need to run below commands

3- Setting up Multinode Cluster

sudo /usr/sbin/hadoop-setup-conf.sh \
--namenode-host=hdfs://${namenode}:9000/ \
--jobtracker-host=${jobtracker}:9001 \
--conf-dir=/etc/hadoop \
--hdfs-dir=/var/lib/hadoop/hdfs \
--namenode-dir=/var/lib/hadoop/hdfs/namenode \
--mapred-dir=/var/lib/hadoop/mapred \
--mapreduce-user=mapred \
--datanode-dir=/var/lib/hadoop/hdfs/data \
--log-dir=/var/log/hadoop \
--auto

Where $namenode and $jobtracker are the Hostname of respective Nodes where you want to run the services, you have to fire this command on everyNode.

4. Now after installation you have to format the namenode

sudo /usr/sbin/hadoop-setup-hdfs.sh

5. For Starting services you can do as below

For single Node

for service in /etc/init.d/hadoop-* ;do sudo $service start ; done

For Multinode

on Master Node

sudo /etc/init.d/hadoop-namenode start
sudo /etc/init.d/hadoop-jobtracker start
sudo /etc/init.d/hadoop-secondarynamenode start

on Slave Node

sudo /etc/init.d/hadoop-datanode start
sudo /etc/init.d/hadoop-tasktracker start

6. You can Create a User Account for you self on HDFS by below command

sudo /usr/sbin/hadoop-create-user.sh -u $USER

Now You can run the word count program as given in previous post. Please try it out and let me know if faced any issue in this.

Thanks

Thursday, March 20, 2014

Hadoop Installation (CDH4 - Yum installation)

Hi Folks,

Today we are going for yum installation of CDH4. its pretty easy one.

Requirement

Oracle JDK 1.6
CentOS 6.4

Installation

1. Downloading the CDH4 Repo file

sudo wget -O /etc/yum.repos.d/cloudera-cdh4.repo http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/cloudera-cdh4.repo

2. Download cloudera cdh4

sudo yum install hadoop-0.20-conf-pseudo

3. Formatting the namenode

sudo -u hdfs hdfs namenode -format

4.Starting HDFS Services on respective nodes

Namenode Services on Master Node
sudo service hadoop-hdfs-namenode start
sudo service hadoop-hdfs-secondarynamenode start

Datanode Services on Master Node(becoz its pseudo mode)

sudo service hadoop-hdfs-datanode start

5. Creating Hdfs Directories on Master

sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir /user

6. Creating Map-reduce Directories on Master node

sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
sudo -u hdfs mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs chown hdfs:hadoop /var/lib/hadoop-hdfs/cache/mapred
sudo -u hdfs chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

7. Starting Mapreduce Services on master and on Slaves

JobTracker Services on Master Node
sudo service hadoop-0.20-mapreduce-jobtracker start
TaskTracker Service on master Node
sudo service hadoop-0.20-mapreduce-tasktracker start

8. Creating Home Directory for Users like hdfs and mapred, replace $user with hdfs and mapred

sudo -u hdfs hadoop fs -mkdir /user/$USER
sudo -u hdfs hadoop fs -chown $USER /user/$USER

9. Update export in .profile

export HADOOP_HOME=/usr/lib/hadoop

10. You can check hdfs directory by

sudo -u hdfs hadoop fs -ls /

Try running any sample job by cmd below.

sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 5 10

NOTE: Please comment you have any problem in it.

Tuesday, March 18, 2014

Hadoop Installations (Tarball)

Hi Folks,

We have seen hadoop installation via many type like rpm, Automatic, tarball, Yum etc. Now in this blog we will do all the types of installation one by one.

Lets try with Tarball Installation.

Requirement

We only require Java installed on the node
JAVA_HOME should be Set.
Check for Iptables(should be off)
SElinux should be disable
Ports should be open (9000; 9001; 50010; 50020; 50030; 50060; 50070; 50075; 50090)

Installation

Download the tarball from the Apache official website

wget http://archive.apache.org/dist/hadoop/core/hadoop-1.0.4/hadoop-1.0.4.tar.gz

Untar the installation

tar -xzvf hadoop-1.0.4..tar.gz

Setting up the variables in .profile of the user

export JAVA HOME=PATH TO JDK INSTALLATION

export HADOOP HOME=/home/hadoop/project/hadoop-1.0.4

export PATH=$JAVA HOME/bin:$HADOOP HOME/bin:$PATH

update JAVA_HOME inside the hadoop-env.sh from $HADOOP_HOME/conf/hadoop-env.sh

Configuration

Editing the following files to set the different parameters for each other, these are the minimal configuration for these files.

$HADOOP_HOME/conf/core-default.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>

$HADOOP_HOME/conf/hdfs-default.xml,

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

$HADOOP_ HOME/conf/mapred-default.xml,

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

update the slave file inside the $HADOOP_HOME/conf/slaves, make all the slave entry inside this file.

We have to do this for all the nodes to set up the hadoop cluster. After doing it for all the nodes now we can start the service after formatting the name node.

Suppose we have master as main node which will act as hadoop name node. so below are the steps we will perform in that node.

$HADOOP_HOME/bin/hadoop namenode -format

This is will format the hdfs and now we are ready to run services on all the nodes.

For Master node

$HADOOP_HOME/sbin/hadoop name node start namenode

$HADOOP_HOME/sbin/hadoop-daemon.sh start jobtracker

$HADOOP_HOME/sbin/hadoop-daemon.sh start secondaryNamenode

For SLAVE NODES

$HADOOP_HOME/sbin/hadoop name node start datanode

$HADOOP_HOME/sbin/hadoop name node start task tracker

Now we can check the service on below URL's

Namenode:- http://master:50070/

Jobtracker:- http://master:50030/

Above are the simplest and easiest tar ball installation of hadoop. please comment if you have any issue while installation.

Hadoop Cluster Designing

Hi Folks ,

I remember when i was trying to design my first cluster with several nodes, i dont have much idea about , what things we need to take care, what would be the disk size, ram size like there were many
questions in my mind.

I tried to find the basic configuration , specific configurations to IO tensive, memory intensive cluster. i have read many blogs , books to get an idea about the cluster designing, kind of loads on clusters. After searching a lot i came across few assumption of cluster designing.

Today i would like to provide you some Assumption have found and created for cluster designing.

Things to Remembers

Cluster Sizing and Hardware

Large no of nodes instead of large no of disk on nodes
Multiple racks give multiple failure domains
Good Commodity hardwares
Always have pilot cluster before implement in some production
Always look for the load type like memory or cpu intensive
Start from basic requirements like 2-4Tb(1U 6 disks or 2U 12 disks)

Networking

Always have proper networking between Nodes
1GbE between the nodes in the Rack
10GbE between the Racks in the cluster
Keep isolated from different cluster for security.

Monitoring

Always have something for monitoring like ganglia for different matrixes
Use Alerting system keeping yourself update while any mis-happening using Nagios
We can also use Ambari and Cloudera manager from different Venders.

Hope you got some idea about the hadoop cluster designing. We we move forward about type of hadoop installation.

Standalone Installation

one node cluster running everything on one machine.
No daemon process is running.

Pseudo Installation

one node cluster running everything on one machine
NN,DT,JT,TT all running on different JVM's
There is only slight difference in pseudo and Standalone installation.

Distributed Installation

As its says a cluster with multiple nodes.
Every daemon process running on different nodes like DN & TT running on slaves Nodes, while NN & JT running on same or may be different Nodes.
We generally used this cluster for POC kind of stuff.

Sunday, March 2, 2014

Hadoop Resources - Books

Hello Guys,

I have been thinking how i can share the hadoop stuff like books, white papers and pdfs. Few days back i was looking for some hadoop book online and i was not able to find it , i have invested 2-3 hours to find that book.

After wasting my time i thought why not i put all the things which i have so that other can easily get it from here. So here i m listing the books which you could get it easily.