Big Data Board: March 2014

Wednesday, March 26, 2014

Common Error in Hadoop - Part 1

Common Error in Hadoop

Error:
10/01/18 10:52:48 INFO mapred.JobClient: Task Id : attempt_201001181020_0002_m_000014_0, Status : FAILED
java.io.IOException: Task process exit with nonzero status of 1.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

Reason:
1. Log Directory might be full, check for no of userlog Directories
2. Size of Log Directories

Solution:
1.Increase the ulimit of the log directory by adding
* hard nofile 10000 into /etc/security/limits.conf
2.Clear some Space by deleting some directories

Error:
Reducer is not starting after map completion like map is 100% and hang after that in pseudo mode.

Reason:
problem with /etc/hosts file

Solution:
1. Check for /etc/hosts and find if IP is given against Hostname,
if yes remove it and give the loopback address which is 127.0.0.1.

Error:
FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /home/hadoop/mydata/hdfs/
namenode is in an inconsistent state: storage directory does not exist or is not accessible.

Reason:
1.Hdfs Directory doesn't Exist or Dont have correct ownership or permissions.

Solution:
Create if not exist and correct the permission according to hdfs.

Error:
Job initialization failed: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device at

Reason:
1.Space was full on log directory of Jobtracker

Solution:
Clear up some space from log directory

Error:
Incompatible namespaceIDS in ...: namenode namespaceID = ..., datanode namespaceID = ...

Reason:
because the format namenode will re-create a new namespaceID, so that the original and datanode inconsistent.

Solution:
1. Data files deleted the datanode dfs.data.dir directory (default is tmp / dfs / data)
2. Modify dfs.data.dir / current / VERSION file the namespaceID and namenode identical to (log errors where there will be prompt)
3. To reassign new dfs.data.dir directory

Error:
Hadoop cluster is started with start-all.sh, slave always fail to start datanode, and will get an error:
Could only be replicated to 0 nodes, instead of 1

Reason:
Is the node identification may be repeated (personally think the wrong reasons). There may also be other reasons, and what solution then tries to solve.

Solution:
1. If port access, you should make sure the port is open, such as hdfs :/ / machine1: 9000 / 50030,50070 like. Executive # iptables-I INPUT-p tcp-dport 9000-j ACCEPT command. If there is an error: hdfs.DFSClient: Exception in createBlockOutputStream java.net.ConnectException: Connection refused in; datanode port can not access, modify iptables: # iptables-I INPUT-s machine1-p tcp-j datanode on ACCEPT
2. There may be firewall restrictions between clusters to communicate with each other. Try to turn off the firewall. / Etc / init.d / iptables stop
3. Finally, there may be not enough disk space, check df -al

Error:
The program execution
Error: java.lang.NullPointerException

Reason:
Null pointer exception, to ensure that the correct java program. Instantiated before the use of the variable what statement do not like array out of bounds. Inspection procedures.
When the implementation of the program, (various) error, make sure that the
situation:

Solution:
1. Premise of your program is correct by compiled
2. Cluster mode, the data to be processed wrote HDFS HDFS path and ensure correct
3. Specify the execution of jar package the entrance class name (I do not know why sometimes you do not specify also can run)
The correct wording similar to this:
$ hadoop jar myCount.jar myCount input output
4. Hadoop start datanode

Error:
Unrecognized option:-jvm Could not the create the Java virtual machine.

Reason:
Hadoop installation directory / bin / hadoop following piece of shell:

Solution:
CLASS = 'org.apache.hadoop.hdfs.server.datanode.DataNode'
   if [[$ EUID-eq 0]]; then
     HADOOP_OPTS = "$ HADOOP_OPTS-jvm server $ HADOOP_DATANODE_OPTS"
   else
     HADOOP_OPTS = "$ HADOOP_OPTS-server $ HADOOP_DATANODE_OPTS"
   fi
$ EUID user ID, if it is the root of this identification will be 0, so try not to use the root user to operate hadoop .

Error:
Terminal error message:
ERROR hdfs.DFSClient: Exception closing file / user / hadoop / musicdata.txt: java.io.IOException: All datanodes 10.210.70.82:50010 are bad. Aborting ...

There are the jobtracker logs the error information

Error register getProtocolVersion
java.lang.IllegalArgumentException: Duplicate metricsName: getProtocolVersion
And possible warning information:

WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Broken pipe
WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_3136320110992216802_1063java.io.IOException: Connection reset by peer
WARN hdfs.DFSClient: Error Recovery for block blk_3136320110992216802_1063 bad datanode [0] 10.210.70.82:50010 put: All datanodes 10.210.70.82:50010 are bad. Aborting ...

solution:
1. Path of under the dfs.data.dir properties of whether the disk is full, try hadoop fs -put data if the processing is full again.
2. Related disk is not full, you need to troubleshoot related disk has no bad sectors, need to be detected.

Error:
Hadoop jar program get the error message:
java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.NullWritable, recieved org.apache.hadoop.io.LongWritable

Or something like this:

Status: FAILED java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

Solution:
Then you need to learn the basics of Hadoop and map reduce model. In "hadoop Definitive Guide book” in Chapter Hadoop I / O and in Chapter VII, MapReduce type and format. If you are eager to solve this problem, I can also tell you a quick solution, but this is bound to affect you later development:
Ensure consistent data:

    ... Extends Mapper ...
    public void map (k1 k, v1 v, OutputCollector output) ...
    ...
    ... Extends Reducer ...
    public void reduce (k2 k, v2 v, OutputCollector output) ...
    ...
    job.setMapOutputKeyClass (k2.class);
    job.setMapOutputValueClass (k2.class);
    job.setOutputKeyClass (k3.class);
    job.setOutputValueClass (v3.class);
    ...

Note that the corresponding k * and v *. Recommendations or two chapters I just said. Know the details of its principles.

Error:
If you hit a datanode error as follows:
ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Cannot lock storage / data1/hadoop_data. The directory is already locked.

Reason:
According to the error prompts view, it is the directory locked, unable to read. At this time you need to look at whether there are related process is still running or slave machine hadoop process is still running, use the linux command to view:

    Netstat -nap
    ps-aux | grep Related PID

Solution:
If hadoop related process is still running, use the kill command to kill can. And then re-use start-all.sh.

Error:
If you encounter the jobtracker error follows:
ERROR: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

Solution:
modify datanode node /etc/hosts file.
Hosts under brief format:
Each line is divided into three parts: the first part of the network IP address, the second part of the host name or domain name, the third part of the host alias detailed steps are as follows:

1.first check the host name:

$ echo –e “ `hostname - i ` \t `hostname -n` \t $stn ”

Stn= short name or alies of hostname.

It will result in something like that

10.200.187.77             hadoop-datanode          DN

If the IP address is configured on successfully modified, or show host name there is a problem, continue to modify the hosts file,
The shuffle error still appears this problem, then try to modify the configuration file of another user said hdfs-site.xml file, add the following:
dfs.http.address
*. *. *: 50070 The ports do not change, instead of the asterisk IP hadoop information transfer through HTTP, the port is same.

Error:
If you encounter the jobtracker error follows:
ERROR: java.lang.RuntimeException: PipeMapRed.waitOutputThreads (): subprocess failed with code *

Reason:
This is a java throws the system returns an error code, the meaning of the error code indicates details.

Sunday, March 23, 2014

Hadoop Installation (type RPM )

Hi Folks,

Today we are going for RPM installation of hadoop. It is also pretty easy as my last hadoop installtion was , So lets try it out.

Requirement

Java JDK (download from here)
hadoop-0.20.204.0-1.i386.rpm (Download from here)

Installation

1. Installation of Java and set Java Home on /etc/profile by export JAVA_HOME=/usr

sudo ./jdk-6u26-linux-x64-rpm.bin.sh

2. Hadoop RPM installation

sudo rpm -i hadoop-0.20.204.0-1.i386.rpm

3. Setting up Single Node cluster

sudo /usr/sbin/hadoop-setup-single-node.sh

You will get many question while setting we up hadoop , like creation of directories and some configuration related, you need to give answer in y.

For MultiNode Setup You Need to run below commands

3- Setting up Multinode Cluster

sudo /usr/sbin/hadoop-setup-conf.sh \
--namenode-host=hdfs://${namenode}:9000/ \
--jobtracker-host=${jobtracker}:9001 \
--conf-dir=/etc/hadoop \
--hdfs-dir=/var/lib/hadoop/hdfs \
--namenode-dir=/var/lib/hadoop/hdfs/namenode \
--mapred-dir=/var/lib/hadoop/mapred \
--mapreduce-user=mapred \
--datanode-dir=/var/lib/hadoop/hdfs/data \
--log-dir=/var/log/hadoop \
--auto

Where $namenode and $jobtracker are the Hostname of respective Nodes where you want to run the services, you have to fire this command on everyNode.

4. Now after installation you have to format the namenode

sudo /usr/sbin/hadoop-setup-hdfs.sh

5. For Starting services you can do as below

For single Node

for service in /etc/init.d/hadoop-* ;do sudo $service start ; done

For Multinode

on Master Node

sudo /etc/init.d/hadoop-namenode start
sudo /etc/init.d/hadoop-jobtracker start
sudo /etc/init.d/hadoop-secondarynamenode start

on Slave Node

sudo /etc/init.d/hadoop-datanode start
sudo /etc/init.d/hadoop-tasktracker start

6. You can Create a User Account for you self on HDFS by below command

sudo /usr/sbin/hadoop-create-user.sh -u $USER

Now You can run the word count program as given in previous post. Please try it out and let me know if faced any issue in this.

Thanks

Thursday, March 20, 2014

Hadoop Installation (CDH4 - Yum installation)

Hi Folks,

Today we are going for yum installation of CDH4. its pretty easy one.

Requirement

Oracle JDK 1.6
CentOS 6.4

Installation

1. Downloading the CDH4 Repo file

sudo wget -O /etc/yum.repos.d/cloudera-cdh4.repo http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/cloudera-cdh4.repo

2. Download cloudera cdh4

sudo yum install hadoop-0.20-conf-pseudo

3. Formatting the namenode

sudo -u hdfs hdfs namenode -format

4.Starting HDFS Services on respective nodes

Namenode Services on Master Node
sudo service hadoop-hdfs-namenode start
sudo service hadoop-hdfs-secondarynamenode start

Datanode Services on Master Node(becoz its pseudo mode)

sudo service hadoop-hdfs-datanode start

5. Creating Hdfs Directories on Master

sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir /user

6. Creating Map-reduce Directories on Master node

sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
sudo -u hdfs mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs chown hdfs:hadoop /var/lib/hadoop-hdfs/cache/mapred
sudo -u hdfs chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

7. Starting Mapreduce Services on master and on Slaves

JobTracker Services on Master Node
sudo service hadoop-0.20-mapreduce-jobtracker start
TaskTracker Service on master Node
sudo service hadoop-0.20-mapreduce-tasktracker start

8. Creating Home Directory for Users like hdfs and mapred, replace $user with hdfs and mapred

sudo -u hdfs hadoop fs -mkdir /user/$USER
sudo -u hdfs hadoop fs -chown $USER /user/$USER

9. Update export in .profile

export HADOOP_HOME=/usr/lib/hadoop

10. You can check hdfs directory by

sudo -u hdfs hadoop fs -ls /

Try running any sample job by cmd below.

sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 5 10

NOTE: Please comment you have any problem in it.

Tuesday, March 18, 2014

Hadoop Installations (Tarball)

Hi Folks,

We have seen hadoop installation via many type like rpm, Automatic, tarball, Yum etc. Now in this blog we will do all the types of installation one by one.

Lets try with Tarball Installation.

Requirement

We only require Java installed on the node
JAVA_HOME should be Set.
Check for Iptables(should be off)
SElinux should be disable
Ports should be open (9000; 9001; 50010; 50020; 50030; 50060; 50070; 50075; 50090)

Installation

Download the tarball from the Apache official website

wget http://archive.apache.org/dist/hadoop/core/hadoop-1.0.4/hadoop-1.0.4.tar.gz

Untar the installation

tar -xzvf hadoop-1.0.4..tar.gz

Setting up the variables in .profile of the user

export JAVA HOME=PATH TO JDK INSTALLATION

export HADOOP HOME=/home/hadoop/project/hadoop-1.0.4

export PATH=$JAVA HOME/bin:$HADOOP HOME/bin:$PATH

update JAVA_HOME inside the hadoop-env.sh from $HADOOP_HOME/conf/hadoop-env.sh

Configuration

Editing the following files to set the different parameters for each other, these are the minimal configuration for these files.

$HADOOP_HOME/conf/core-default.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>

$HADOOP_HOME/conf/hdfs-default.xml,

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

$HADOOP_ HOME/conf/mapred-default.xml,

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

update the slave file inside the $HADOOP_HOME/conf/slaves, make all the slave entry inside this file.

We have to do this for all the nodes to set up the hadoop cluster. After doing it for all the nodes now we can start the service after formatting the name node.

Suppose we have master as main node which will act as hadoop name node. so below are the steps we will perform in that node.

$HADOOP_HOME/bin/hadoop namenode -format

This is will format the hdfs and now we are ready to run services on all the nodes.

For Master node

$HADOOP_HOME/sbin/hadoop name node start namenode

$HADOOP_HOME/sbin/hadoop-daemon.sh start jobtracker

$HADOOP_HOME/sbin/hadoop-daemon.sh start secondaryNamenode

For SLAVE NODES

$HADOOP_HOME/sbin/hadoop name node start datanode

$HADOOP_HOME/sbin/hadoop name node start task tracker

Now we can check the service on below URL's

Namenode:- http://master:50070/

Jobtracker:- http://master:50030/

Above are the simplest and easiest tar ball installation of hadoop. please comment if you have any issue while installation.

Hadoop Cluster Designing

Hi Folks ,

I remember when i was trying to design my first cluster with several nodes, i dont have much idea about , what things we need to take care, what would be the disk size, ram size like there were many
questions in my mind.

I tried to find the basic configuration , specific configurations to IO tensive, memory intensive cluster. i have read many blogs , books to get an idea about the cluster designing, kind of loads on clusters. After searching a lot i came across few assumption of cluster designing.

Today i would like to provide you some Assumption have found and created for cluster designing.

Things to Remembers

Cluster Sizing and Hardware

Large no of nodes instead of large no of disk on nodes
Multiple racks give multiple failure domains
Good Commodity hardwares
Always have pilot cluster before implement in some production
Always look for the load type like memory or cpu intensive
Start from basic requirements like 2-4Tb(1U 6 disks or 2U 12 disks)

Networking

Always have proper networking between Nodes
1GbE between the nodes in the Rack
10GbE between the Racks in the cluster
Keep isolated from different cluster for security.

Monitoring

Always have something for monitoring like ganglia for different matrixes
Use Alerting system keeping yourself update while any mis-happening using Nagios
We can also use Ambari and Cloudera manager from different Venders.

Hope you got some idea about the hadoop cluster designing. We we move forward about type of hadoop installation.

Standalone Installation

one node cluster running everything on one machine.
No daemon process is running.

Pseudo Installation

one node cluster running everything on one machine
NN,DT,JT,TT all running on different JVM's
There is only slight difference in pseudo and Standalone installation.

Distributed Installation

As its says a cluster with multiple nodes.
Every daemon process running on different nodes like DN & TT running on slaves Nodes, while NN & JT running on same or may be different Nodes.
We generally used this cluster for POC kind of stuff.

Sunday, March 2, 2014

Hadoop Resources - Books

Hello Guys,

I have been thinking how i can share the hadoop stuff like books, white papers and pdfs. Few days back i was looking for some hadoop book online and i was not able to find it , i have invested 2-3 hours to find that book.

After wasting my time i thought why not i put all the things which i have so that other can easily get it from here. So here i m listing the books which you could get it easily.

Saturday, March 1, 2014

Hadoop - A Solution to Bigdata

Hadoop As definition it is application written in java, which enable the distributed computing on very large volume of data sets runs across commodity hardware.

Father of Hadoop is Doug cutting, who actually got the idea from Google file system. Doug started the project few years back. Now Hadoop is very well popular and Hot in the market. Peoples are learning, companies are take it up and start using in many areas as we have seen in last Blog.

1st hadoop was the Apache Hadoop, As we all aware of ASF(Apache Software foundation), After Doug Apache did the most of patching and make it well versed. Now there are many venders who are providing their hadoop Versions. The base is always the Apache hadoop.

Different Hadoop venders in the market now a days.

There are many newbie which coming in the market. This shows that what would be the future of Hadoop. These venders are providing something unique from each other, Now a days i would say each one have their own dependencies and usability.

Lets talk about the Hadoop Ecosystem, Hadoop Ecosystem Consist of Different Components , they all are top notch projects in ASF. I am listing them below and give a brief intro about them

Hdfs :-Its Hadoop distributed file system, very reliable, fault tolerant , high performance and Scalable and to facilitate the data to store on different commodity hardware. It have larger block size then the normal filesystems. It is written in java.

Hive :- It is basically an interface on the top of hadoop to workout with the data files in the tabular form. Hive is sql based Dwh System (data-ware house) to facilitate the data symmetrization , data query and Analysis of Large data set , stored in hdfs.

Pig :- Pig is a platform for constructing data flows for extract, transform, and load (ETL) processing and analysis of large datasets. Pig Latin, the programming language for Pig provides common data manipulation operations, such as grouping, joining, and filtering. Pig generates Hadoop MapReduce jobs to perform the data flows.

MapReduce :- Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results. It is now the most widely-used, general-purpose computing model and runtime system for distributed data analytic.

Hbase :- Its a columnar database, able to store millions of column and billions of rows. Its installed on the top of hadoop and stores structured and non structured data. Stored data as key and value format. Major thing is it provide the update value facility, high performance with very low latency.
It stores everything as bytes.

Sqoop :- Sqoop is basically designed for users to import and export the data from relational database to their hadoop clusters. Its generate Mapreduce in the background to do the import and Export. It can also do the import from the many databases at once.

Flume :- Flume is very reliable, efficient, distributed system to collect the logs from different source to store into hadoop cluster on real time. It has Simple and Flexible Architecture based on Streaming Data flows.

Zookeeper :- As the name suggest , it actually controls the all the animals of hadoop ecosystem in the zoo. As it is basically a coordination system for distributed applications. It is Centralized Service for maintaining configuration information, distributed synchronization , naming and provide group services.

Oozie :- Oozie is server based workflow engine specialized in running workflow jobs with actions that executed mapred and pig jobs. Oozie Provide the abstraction that will batch a set of coordination applications. It give the power to user to start/stop/pause/resume to set of jobs.

Mahout :- Mahout is tool to implement the different analytics on hadoop data. It able to perform machine learning Algorithms on Hadoop Filesystem. Through mahout you can do the recommendation , data mining and Clustering etc.

Avro :- Avro is Serialization system which provide the dynamic integration with many scripting languages like python, ruby etc. It supports different file format and may text encoding.

Chukwa :- Its data collection system for managing large distributed systems, it facilitate to display, monitoring and analyzing the log files.

There are many more, which i will describe in my later posts.

Thanks All

Bigdata - World of Digital Data

Definition: A massive amount of data which could be Structured , Unstructured and Semi-structured.

As we all know about the data is increasing day by day exponentially then Doug Cutting had found the solution for that which we called HADOOP. As the say Bigdata includes 3V's which is Velocity , Variety and volume , there is one more aspect to see which is Veracity which makes it 4V's.

Lets have a quick view to it.

Lets Talk about the other Aspects of Bigdata Things, let us more focus on the things we can do with Bigdata.

Hardware and Software Technologies to solve the large volume of data problems
Database Scaling-out
Distributed File System
Different Analytic Systems like Relation and Real time Analytics

Application Domains of BigData

Digital Market Optimizations
Data Exploration
Fraud Detection
Social Networking Analysis
Log Analytics

Use Cases for Bigdata

Banking (Ex Amex )
Telecom (Ex china mobile)
Medical (Ex. NextBio)
Social Networking (Ex Facebook)
Life Science (Ex Eli-lilly)
Retails (Ex Sears)
Energy (Ex Opower)
Traveling (Ex Expedia)

You can find the Various use case of Hadoop in here .

Thanks All.