Big Data Board: Hadoop

Showing posts with label Hadoop. Show all posts

Monday, July 6, 2015

Kafka Implementation on Centos

Hi Folks , here i am going to show how to implement the kafka on linux host and to make use of it, kafka installation is pretty simple and can be done in matter of few mins.

Lets first understand what is kafka and why we are using it, some of basic stuff about kafka.

Kafka:- It is a distributed messaging system, where we have different components to produce / public and consume the messages stream. It is fault tolerant, consistent high performance throughput , very helpful even if you have to process live stream of TB's data. So lets see its architecture.

Zookeeper:- As we all know that it is used for maintain the state of the process and jobs, in this case it is used for maintaining and updating the consumed message offset/storing the broker address etc. Zookeeper is required to run kafka on machine.

Producer:- Producer create the topics and sent the message of that topics to broker for further processing.

Broker:- Broker stores the data written by producers, its can store multiple read and write a time.

Consumer:- Consumer polls the messages from broker and use it.

Installation of kafka

1. Download the kafka from their official site or click here.

2. Untar it some place like i did on my user's home.

$ tar -xvzf kafka_2.10-0.8.2.0.tgz

$ mv kafka_2.10-0.8.2.0 kafka-0.8.2

3. Now you have start the zookeeper which is very important for running kafka producer , so just run the below commands

$ cd kafka-0.8.2

$ bin/zookeeper-server-start.sh /home/kafka/kafka-0.8.2/config/zookeeper.properties

note:- you need to edit your zookeeper.properties to define the data directory.

4. Now the zookeeper is started you can start your kafka server , which is pretty simple to execute

$ bin/kafka-server-start.sh /home/kafka/kafka-0.8.2/config/server.properties

5. Now the server is start , we can create a sample topic and public it to broker so lets create a sample topic

$ bin/kafka-topics.sh --create --zookeeper kafka:2181 --replication-factor 1 --partitions 1 --topic test

where kafka:2181 is zookeeper host and port , replication factor/partition is 1 and topic name is test

6. Now you can check the topic is created or not by below command

$ bin/kafka-topics.sh --list --zookeeper kafka:2181

7. Now there are many ways to feed the data like command line and from Automated live feed

// you can feed the data from command line by below method.

$ bin/kafka-console-producer.sh --broker-list kafka:9092 --topic test

8. Now you can read the data whatever you have given in above commands

$ bin/kafka-console-consumer.sh --zookeeper kafka:2181 --topic test --from-beginning

output will the same words or data given while commands 7.

9. You can see the details of the topic you have created by below commands

$ bin/kafka-topics.sh --describe --zookeeper kafka:2181 --topic my-replicated-topic

That is all for now, please comment if you have any queries and doubts.

Saturday, September 27, 2014

Spark 1.0.x on Yarn

Hi Folks, i have tried to setup spark on hadoop 2.x which is with yarn. It was great becoz spark supports not only map reduce but yarn and other paradigm also.

Will write about spark on my next blog, let me show you how can we set up spark on yarn cluster.

1. Created four node cluster with HDP2.x and MRv2(yarn)

here is the link to install it on mac or same step you can follow to install on any linux system.

Ex.


IP	Role
192.168.1.101(Node1)	ActiveNameNode, RM
192.168.1 .102(Node2)	StandbyNameNode, Master, Worker
192.168.1 .103(Node3)	DataNode, Worker
192.168.1 .104(Node4)	DataNode, Worker

2. Spark common installation and deployment model has Spark On Yarn and Standalone, can be used simultaneously.

In spark we have master and worker which runs on the cluster to perform the tasks on yarn. here we have run master on Node2 and work on other like node2,3,4.

You need to download correct versions of MRv2 and Spark they should be comparable with each other else you would be running in to compatibility issue.

so download the spark and place inside any directory like in this case i put it inside home directory on mapred user.

$ /home/mapred/spark-1.0.2-bin-hadoop2.tar

$ cd /home/mapred/ ; tar -xvf spark-1.0.2-bin-hadoop2.tar

3. Deploy this model, you need to modify the spark-env.sh file conf directory.
Add the following configuration options in which:

export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/usr/lib/hadoop/lib/native/
export HADOOP_HOME = /usr/lib/hadoop
export HADOOP_CONF_DIR= $HADOOP_HOME/etc/hadoop
export SPARK_EXECUTOR_INSTANCES = 2
export SPARK_EXECUTOR_CORES = 1
export SPARK_EXECUTOR_MEMORY = 400M
export SPARK_DRIVER_MEMORY = 400M
export SPARK_YARN_APP_NAME = "Spark 1.0.2"
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/lib/hadoop/lib/hadoop-lzo-0.5.0.jar:/usr/lib/hadoop/lib/

4. Copy the same configuration on all worker nodes and start the services like below.

On Namenode (node2)

$ $SPARK_HOME/sbin/start-master.sh

On Datanode/Slaves nodes

$ ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://node2:7077 &

5. Now you can run your sample pi program

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 /home/mapred/spark1.0.2/lib/spark-examples-1.0.2-hadoop2.2.0.jar 10

You can check the result on JT URI

Tuesday, March 18, 2014

Hadoop Installations (Tarball)

Hi Folks,

We have seen hadoop installation via many type like rpm, Automatic, tarball, Yum etc. Now in this blog we will do all the types of installation one by one.

Lets try with Tarball Installation.

Requirement

We only require Java installed on the node
JAVA_HOME should be Set.
Check for Iptables(should be off)
SElinux should be disable
Ports should be open (9000; 9001; 50010; 50020; 50030; 50060; 50070; 50075; 50090)

Installation

Download the tarball from the Apache official website

wget http://archive.apache.org/dist/hadoop/core/hadoop-1.0.4/hadoop-1.0.4.tar.gz

Untar the installation

tar -xzvf hadoop-1.0.4..tar.gz

Setting up the variables in .profile of the user

export JAVA HOME=PATH TO JDK INSTALLATION

export HADOOP HOME=/home/hadoop/project/hadoop-1.0.4

export PATH=$JAVA HOME/bin:$HADOOP HOME/bin:$PATH

update JAVA_HOME inside the hadoop-env.sh from $HADOOP_HOME/conf/hadoop-env.sh

Configuration

Editing the following files to set the different parameters for each other, these are the minimal configuration for these files.

$HADOOP_HOME/conf/core-default.xml

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>

$HADOOP_HOME/conf/hdfs-default.xml,

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

$HADOOP_ HOME/conf/mapred-default.xml,

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

update the slave file inside the $HADOOP_HOME/conf/slaves, make all the slave entry inside this file.

We have to do this for all the nodes to set up the hadoop cluster. After doing it for all the nodes now we can start the service after formatting the name node.

Suppose we have master as main node which will act as hadoop name node. so below are the steps we will perform in that node.

$HADOOP_HOME/bin/hadoop namenode -format

This is will format the hdfs and now we are ready to run services on all the nodes.

For Master node

$HADOOP_HOME/sbin/hadoop name node start namenode

$HADOOP_HOME/sbin/hadoop-daemon.sh start jobtracker

$HADOOP_HOME/sbin/hadoop-daemon.sh start secondaryNamenode

For SLAVE NODES

$HADOOP_HOME/sbin/hadoop name node start datanode

$HADOOP_HOME/sbin/hadoop name node start task tracker

Now we can check the service on below URL's

Namenode:- http://master:50070/

Jobtracker:- http://master:50030/

Above are the simplest and easiest tar ball installation of hadoop. please comment if you have any issue while installation.

Saturday, March 1, 2014

Hadoop - A Solution to Bigdata

Hadoop As definition it is application written in java, which enable the distributed computing on very large volume of data sets runs across commodity hardware.

Father of Hadoop is Doug cutting, who actually got the idea from Google file system. Doug started the project few years back. Now Hadoop is very well popular and Hot in the market. Peoples are learning, companies are take it up and start using in many areas as we have seen in last Blog.

1st hadoop was the Apache Hadoop, As we all aware of ASF(Apache Software foundation), After Doug Apache did the most of patching and make it well versed. Now there are many venders who are providing their hadoop Versions. The base is always the Apache hadoop.

Different Hadoop venders in the market now a days.

There are many newbie which coming in the market. This shows that what would be the future of Hadoop. These venders are providing something unique from each other, Now a days i would say each one have their own dependencies and usability.

Lets talk about the Hadoop Ecosystem, Hadoop Ecosystem Consist of Different Components , they all are top notch projects in ASF. I am listing them below and give a brief intro about them

Hdfs :-Its Hadoop distributed file system, very reliable, fault tolerant , high performance and Scalable and to facilitate the data to store on different commodity hardware. It have larger block size then the normal filesystems. It is written in java.

Hive :- It is basically an interface on the top of hadoop to workout with the data files in the tabular form. Hive is sql based Dwh System (data-ware house) to facilitate the data symmetrization , data query and Analysis of Large data set , stored in hdfs.

Pig :- Pig is a platform for constructing data flows for extract, transform, and load (ETL) processing and analysis of large datasets. Pig Latin, the programming language for Pig provides common data manipulation operations, such as grouping, joining, and filtering. Pig generates Hadoop MapReduce jobs to perform the data flows.

MapReduce :- Map-reduce is a data processing paradigm for condensing large volumes of data into useful aggregated results. It is now the most widely-used, general-purpose computing model and runtime system for distributed data analytic.

Hbase :- Its a columnar database, able to store millions of column and billions of rows. Its installed on the top of hadoop and stores structured and non structured data. Stored data as key and value format. Major thing is it provide the update value facility, high performance with very low latency.
It stores everything as bytes.

Sqoop :- Sqoop is basically designed for users to import and export the data from relational database to their hadoop clusters. Its generate Mapreduce in the background to do the import and Export. It can also do the import from the many databases at once.

Flume :- Flume is very reliable, efficient, distributed system to collect the logs from different source to store into hadoop cluster on real time. It has Simple and Flexible Architecture based on Streaming Data flows.

Zookeeper :- As the name suggest , it actually controls the all the animals of hadoop ecosystem in the zoo. As it is basically a coordination system for distributed applications. It is Centralized Service for maintaining configuration information, distributed synchronization , naming and provide group services.

Oozie :- Oozie is server based workflow engine specialized in running workflow jobs with actions that executed mapred and pig jobs. Oozie Provide the abstraction that will batch a set of coordination applications. It give the power to user to start/stop/pause/resume to set of jobs.

Mahout :- Mahout is tool to implement the different analytics on hadoop data. It able to perform machine learning Algorithms on Hadoop Filesystem. Through mahout you can do the recommendation , data mining and Clustering etc.

Avro :- Avro is Serialization system which provide the dynamic integration with many scripting languages like python, ruby etc. It supports different file format and may text encoding.

Chukwa :- Its data collection system for managing large distributed systems, it facilitate to display, monitoring and analyzing the log files.

There are many more, which i will describe in my later posts.

Thanks All

Bigdata - World of Digital Data

Definition: A massive amount of data which could be Structured , Unstructured and Semi-structured.

As we all know about the data is increasing day by day exponentially then Doug Cutting had found the solution for that which we called HADOOP. As the say Bigdata includes 3V's which is Velocity , Variety and volume , there is one more aspect to see which is Veracity which makes it 4V's.

Lets have a quick view to it.

Lets Talk about the other Aspects of Bigdata Things, let us more focus on the things we can do with Bigdata.

Hardware and Software Technologies to solve the large volume of data problems
Database Scaling-out
Distributed File System
Different Analytic Systems like Relation and Real time Analytics

Application Domains of BigData

Digital Market Optimizations
Data Exploration
Fraud Detection
Social Networking Analysis
Log Analytics

Use Cases for Bigdata

Banking (Ex Amex )
Telecom (Ex china mobile)
Medical (Ex. NextBio)
Social Networking (Ex Facebook)
Life Science (Ex Eli-lilly)
Retails (Ex Sears)
Energy (Ex Opower)
Traveling (Ex Expedia)

You can find the Various use case of Hadoop in here .

Thanks All.