Big Data Board: 2015

Saturday, July 11, 2015

Storm Installation on CentOs

Hi Folks, Storm is one of best real-time processing System now a days, companies have started using it on large scale. It is one of the best distributed real-time computation System similarly like spark.

You might have heard the integration of storm with many data-pipe line for data processing, Its having one unique feature which is different from other and it is once its start it will run like forever until you kills it.

Basic difference from hadoop is, it is for realtime processing unlike hadoop which is for batch processing.

So lets see how we can get it on our system up and running.

Step 1 :- Download the storm from official apache site and unzip it you will find the couple of folder and storm jar

$ wget https://github.com/downloads/nathanmarz/storm/storm-0.8.1.zip
$ unzip storm-0.8.1.zip

[storm@kafka ~]$ ll storm-0.8.1
total 4780
drwxr-xr-x. 2 storm storm    4096 Sep 6 2012 bin
-rw-r--r--. 1 storm storm   19981 Sep 6 2012 CHANGELOG.md
drwxr-xr-x. 2 storm storm    4096 Jun 25 08:19 conf
drwxrwxr-x. 4 storm storm    4096 Jun 25 05:24 data
drwxr-xr-x. 2 storm storm    4096 Sep 6 2012 lib
-rw-r--r--. 1 storm storm   12710 Sep 6 2012 LICENSE.html
drwxr-xr-x. 2 storm storm    4096 Sep 6 2012 log4j
drwxr-xr-x. 2 storm storm    4096 Jun 25 08:21 logs
-rw-------. 1 storm storm   25640 Jun 25 07:32 nohup.out
drwxr-xr-x. 4 storm storm    4096 Sep 6 2012 public
-rw-r--r--. 1 storm storm    3730 Sep 6 2012 README.markdown
-rw-r--r--. 1 storm storm       6 Sep 6 2012 RELEASE
-rw-r--r--. 1 storm storm 4789764 Sep 6 2012 storm-0.8.1.jar

Step 2:- Download zeromq from its official site

$ wget http://download.zeromq.org/zeromq-2.1.7.zip

$ unzip zeromq-2.1.7.zip

you will see bunch of file and make file , now you need to build it through mvn

$ cd zeromq-2.1.7

$ ./configure && make

if its failing during build you need to run below commands to install required libraries

sudo yum install libuuid*
sudo yum install uuid-*
sudo yum install gcc-*
sudo yum install git
sudo yum install libtool*

Step 3: After configuration of zeromq we need jzmq from git.

$ git clone https://github.com/nathanmarz/jzmq.git
$ ./Makefile.am
$ sed -i 's/classdist_noinst.stamp/classnoinst.stamp/g' src/Makefile.am
$ ./autogen.sh
$ ./configure && make install

Step 4: Download the zookeeper from its official site

$ wget http://www.webhostingreviewjam.com/mirror/apache/zookeeper/stable/zookeeper-3.4.6.tar.gz
$ unzip zookeeper-3.4.6.tar.gz
$ mkdir zookeeper-3.4.6/data

Now update the zoo.conf with data folder and port no.

dataDir=~/data
# the port at which the clients will connect
clientPort=2181

Step 5: Update the storm configuration file and make some entries

$ vi storm.yaml

########### These MUST be filled in for a storm configuration
storm.zookeeper.servers:
- "192.168.99.141" // your ip address
storm.zookeeper.port: 2181
nimbus.host: "192.168.99.141" // your ip address
nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
supervisor.childopts: "-Djava.net.preferIPv4Stack=true"
worker.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
nimbus.thrift.port: 8627
ui.port: 8772
storm.local.dir: "/home/storm/storm-0.8.1/data" // your data dir path
java.library.path: "/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703

Step 6:- Start the services nimbus, supervisor , ui

before that add below to .profile file

export STORM_HOME="/home/storm/storm-0.8.1"
export JAVA_HOME="/usr"
export PATH=$STORM_HOME/bin:$JAVA_HOME/bin:$PATH
export ZOOKEEPER_HOME="/home/zookeeper/zookeeper-3.4.6"
export PATH=$ZOOKEEPER_HOME/bin:$PATH

Now start the services in background.

$ zkServer.sh start
$ nohup storm nimbus &
$ nohup storm supervisor &
$ nohup strom ui &

Now you can see the services running like

[storm@kafka ~]$ jps
3354 core
3247 nimbus
3440 Jps
3332 supervisor
3083 QuorumPeerMain

You can view the web ui at http://localhost:8772

Monday, July 6, 2015

Kafka Implementation on Centos

Hi Folks , here i am going to show how to implement the kafka on linux host and to make use of it, kafka installation is pretty simple and can be done in matter of few mins.

Lets first understand what is kafka and why we are using it, some of basic stuff about kafka.

Kafka:- It is a distributed messaging system, where we have different components to produce / public and consume the messages stream. It is fault tolerant, consistent high performance throughput , very helpful even if you have to process live stream of TB's data. So lets see its architecture.

Zookeeper:- As we all know that it is used for maintain the state of the process and jobs, in this case it is used for maintaining and updating the consumed message offset/storing the broker address etc. Zookeeper is required to run kafka on machine.

Producer:- Producer create the topics and sent the message of that topics to broker for further processing.

Broker:- Broker stores the data written by producers, its can store multiple read and write a time.

Consumer:- Consumer polls the messages from broker and use it.

Installation of kafka

1. Download the kafka from their official site or click here.

2. Untar it some place like i did on my user's home.

$ tar -xvzf kafka_2.10-0.8.2.0.tgz

$ mv kafka_2.10-0.8.2.0 kafka-0.8.2

3. Now you have start the zookeeper which is very important for running kafka producer , so just run the below commands

$ cd kafka-0.8.2

$ bin/zookeeper-server-start.sh /home/kafka/kafka-0.8.2/config/zookeeper.properties

note:- you need to edit your zookeeper.properties to define the data directory.

4. Now the zookeeper is started you can start your kafka server , which is pretty simple to execute

$ bin/kafka-server-start.sh /home/kafka/kafka-0.8.2/config/server.properties

5. Now the server is start , we can create a sample topic and public it to broker so lets create a sample topic

$ bin/kafka-topics.sh --create --zookeeper kafka:2181 --replication-factor 1 --partitions 1 --topic test

where kafka:2181 is zookeeper host and port , replication factor/partition is 1 and topic name is test

6. Now you can check the topic is created or not by below command

$ bin/kafka-topics.sh --list --zookeeper kafka:2181

7. Now there are many ways to feed the data like command line and from Automated live feed

// you can feed the data from command line by below method.

$ bin/kafka-console-producer.sh --broker-list kafka:9092 --topic test

8. Now you can read the data whatever you have given in above commands

$ bin/kafka-console-consumer.sh --zookeeper kafka:2181 --topic test --from-beginning

output will the same words or data given while commands 7.

9. You can see the details of the topic you have created by below commands

$ bin/kafka-topics.sh --describe --zookeeper kafka:2181 --topic my-replicated-topic

That is all for now, please comment if you have any queries and doubts.