Showing posts with label Structured. Show all posts
Showing posts with label Structured. Show all posts

Thursday, March 20, 2014

Hadoop Installation (CDH4 - Yum installation)


Hi Folks,

Today we are going for yum installation of CDH4. its pretty easy one.

Requirement
  •  Oracle JDK 1.6
  •  CentOS 6.4
Installation

1. Downloading the CDH4 Repo file
sudo wget -O /etc/yum.repos.d/cloudera-cdh4.repo http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/cloudera-cdh4.repo
2.  Download cloudera cdh4
sudo yum install hadoop-0.20-conf-pseudo
3. Formatting the namenode
sudo -u hdfs hdfs namenode -format
4.Starting HDFS Services on respective nodes

  • Namenode Services on Master Node
    sudo service hadoop-hdfs-namenode start
    sudo service hadoop-hdfs-secondarynamenode start
  • Datanode Services on Master Node(becoz its pseudo mode)
sudo service hadoop-hdfs-datanode start
5. Creating Hdfs Directories on Master
sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir /user
6. Creating Map-reduce Directories on  Master node
sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
sudo -u hdfs mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs chown hdfs:hadoop /var/lib/hadoop-hdfs/cache/mapred
sudo -u hdfs chown -R mapred /var/lib/hadoop-hdfs/cache/mapred 
 7. Starting Mapreduce Services on master and on Slaves
  • JobTracker Services on Master Node
     sudo service hadoop-0.20-mapreduce-jobtracker start
  •  TaskTracker Service on master Node
    sudo service hadoop-0.20-mapreduce-tasktracker start
8. Creating Home Directory for Users like hdfs and mapred, replace $user with hdfs and mapred
sudo -u hdfs hadoop fs -mkdir /user/$USER
sudo -u hdfs hadoop fs -chown $USER /user/$USER
 9. Update export in .profile
export HADOOP_HOME=/usr/lib/hadoop
 10. You can check hdfs directory by
sudo -u hdfs hadoop fs -ls  /
Try running any sample job by cmd below.
sudo -u hdfs hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar pi 5  10

 NOTE: Please comment you have any problem in it.

Tuesday, March 18, 2014

Hadoop Cluster Designing

Hi Folks ,

I remember when i was trying to design my first cluster with several nodes, i dont have much idea about , what things we need to take care, what would be the disk size, ram size like there were many
questions in my mind.

I tried to find the basic configuration , specific configurations to IO tensive, memory intensive cluster. i have read many blogs , books to get an idea about the cluster designing, kind of loads on clusters. After searching a lot i came across few assumption of cluster designing.

Today i would like to provide you some Assumption have found and created for cluster designing.

Things to Remembers
  •  Cluster Sizing and Hardware 
    • Large no of nodes instead of large no of disk on nodes
    • Multiple racks give multiple failure domains
    • Good Commodity hardwares
    • Always have pilot cluster before implement in some production
    • Always look for the load type like memory or cpu intensive 
    • Start from basic requirements like 2-4Tb(1U 6 disks or 2U 12 disks)
  • Networking
    • Always have proper networking between Nodes
    • 1GbE  between the nodes in the Rack
    • 10GbE between the Racks in the cluster
    • Keep isolated from different cluster for security.
  • Monitoring
    • Always have something for monitoring like ganglia for different matrixes
    • Use Alerting system keeping yourself update while any mis-happening using Nagios
    • We can also use Ambari and Cloudera manager from different Venders.  


Hope you got some idea about the hadoop cluster designing. We we move forward about type of hadoop installation.

  • Standalone Installation
    • one node cluster running everything on one machine.
    • No daemon process is running.
  • Pseudo Installation
    • one node cluster running everything on one machine
    • NN,DT,JT,TT all running on different JVM's 
    • There is only slight difference in pseudo and Standalone installation.
  • Distributed Installation
    • As its says a cluster with multiple nodes.
    • Every daemon process running on different nodes like DN & TT running on slaves Nodes, while NN & JT running on same or may be different Nodes.
    • We generally used this cluster for POC kind of stuff.

Saturday, March 1, 2014

Hadoop - A Solution to Bigdata


Hadoop As definition it is application written in java, which enable the distributed computing on very large volume of data sets runs across commodity hardware.

Father of Hadoop is Doug cutting, who actually got the idea from Google file system. Doug started the project few years back. Now Hadoop is very well popular and Hot in the market. Peoples are learning, companies are take it up and start using in many areas as we have seen in last Blog.

1st hadoop was the Apache Hadoop, As we all aware of ASF(Apache Software foundation), After Doug Apache did the most of patching and make it well versed. Now there are many venders who are providing their hadoop Versions. The base is always the Apache hadoop.

 Different Hadoop venders in the market now a days.
There are many newbie which coming in the market. This shows that what would be the future of Hadoop. These venders are providing something unique from each other, Now a days i would say each one have their own dependencies and usability. 


 
 Lets talk about the Hadoop Ecosystem, Hadoop Ecosystem Consist of Different  Components , they all are top notch projects in ASF. I am listing them below and give a brief intro about them



Hdfs :-Its Hadoop distributed file system, very reliable, fault tolerant , high performance and Scalable and to facilitate the data to store on different commodity hardware. It have larger block size then the normal filesystems. It is written in java.

Hive :-  It is basically an interface on the top of hadoop to workout with the data files in the tabular form. Hive is sql based Dwh System (data-ware house) to facilitate the data symmetrization , data query and Analysis of Large data set , stored in hdfs.

Pig :- Pig is a platform for constructing data flows for extract, transform, and load (ETL) processing and analysis of large datasets. Pig Latin, the programming language for Pig provides common data manipulation operations, such as grouping, joining, and filtering. Pig generates Hadoop MapReduce jobs to perform the data flows.

MapReduce :- Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results. It is now the most widely-used, general-purpose computing model and runtime system for distributed data analytic.

Hbase :- Its a columnar database, able to store millions of column and billions of rows. Its installed on the top of hadoop and stores structured and non structured data. Stored data as key and value format. Major thing is it provide the update value facility, high performance with very low latency.
It stores everything as bytes.

Sqoop :- Sqoop is basically designed for users to import and export the data from relational database to their hadoop clusters. Its generate Mapreduce in the background to do the import and Export. It can also do the import from the many databases at once.

Flume :- Flume is very reliable, efficient, distributed system to collect the logs from different source to store into hadoop cluster on real time. It has Simple and Flexible Architecture based on Streaming Data flows.

Zookeeper :- As the name suggest , it actually controls the all the animals of hadoop ecosystem in the zoo. As it is basically a coordination system for distributed applications. It is Centralized Service for maintaining configuration information, distributed synchronization , naming and provide group services.

Oozie :- Oozie is server based workflow engine specialized in running workflow jobs with actions that executed mapred and pig jobs. Oozie Provide the abstraction that will batch a set of coordination applications. It give the power to user to start/stop/pause/resume to set of jobs.

Mahout  :- Mahout is tool to implement the different analytics on hadoop data. It able to perform machine learning Algorithms on Hadoop Filesystem. Through mahout you can do the recommendation , data mining  and Clustering etc.

Avro :- Avro is Serialization system which provide the dynamic integration with many scripting languages like python, ruby etc. It supports different file format and may text encoding.

Chukwa :- Its data collection system for managing large distributed systems, it facilitate to display, monitoring and analyzing the log files.

There are many more, which i will describe in my later posts.

Thanks All


  

Bigdata - World of Digital Data

 Definition: A massive amount of data which could be Structured , Unstructured and Semi-structured.


As we all know about the data is increasing day by day exponentially then Doug Cutting had found the solution for that which we called HADOOP. As the say Bigdata includes 3V's which is Velocity , Variety and volume , there is one more aspect to see which is Veracity which makes it 4V's.

Lets have a quick view to it.

4V's

Lets Talk about the other Aspects of Bigdata Things, let us more focus on the things we can do with Bigdata.

  • Hardware and Software Technologies to solve the large volume of data problems
  • Database Scaling-out 
  • Distributed File System
  • Different Analytic Systems like Relation and Real time Analytics 
 Application Domains of BigData
  • Digital Market Optimizations 
  • Data Exploration 
  • Fraud Detection
  • Social Networking Analysis
  • Log Analytics
Use Cases for Bigdata
  • Banking   (Ex Amex )
  • Telecom  (Ex china mobile)
  • Medical   (Ex. NextBio)
  • Social Networking    (Ex Facebook)
  • Life Science  (Ex Eli-lilly)
  • Retails   (Ex Sears)
  • Energy    (Ex Opower)
  • Traveling (Ex Expedia)
You can find the Various use case of Hadoop in here .

 Thanks All.