Saturday, September 27, 2014

Spark 1.0.x on Yarn

Hi Folks, i have tried to setup spark on hadoop 2.x which is with yarn. It was great becoz spark supports not only map reduce but yarn and other paradigm also.

Will write about spark on my next blog, let me show you how can we set up spark on yarn cluster.

1. Created four node cluster with HDP2.x and MRv2(yarn)

here is the link to install it on mac or same step you can follow to install on any linux system.

Ex.
IP
Role
192.168.1.101(Node1)
ActiveNameNode, RM
192.168.1 .102(Node2)
StandbyNameNode, Master, Worker
192.168.1 .103(Node3)
DataNode, Worker
192.168.1 .104(Node4)
DataNode, Worker

2. Spark common installation and deployment model has Spark On Yarn and Standalone, can be used simultaneously.

In spark we have master and worker which runs on the cluster to perform the tasks on yarn. here we have run master on Node2 and work on other like node2,3,4.  

You need to download correct versions of MRv2 and Spark they should be comparable with each other else you would be running in to compatibility issue.

so download the spark and place inside any directory like in this case i put it inside home directory on mapred user.

$    /home/mapred/spark-1.0.2-bin-hadoop2.tar
$    cd /home/mapred/  ;   tar -xvf spark-1.0.2-bin-hadoop2.tar 

3. Deploy this model, you need to modify the spark-env.sh file conf directory.
Add the following configuration options in which:
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/usr/lib/hadoop/lib/native/
export HADOOP_HOME = /usr/lib/hadoop
export HADOOP_CONF_DIR= $HADOOP_HOME/etc/hadoop  
export SPARK_EXECUTOR_INSTANCES = 2
export SPARK_EXECUTOR_CORES = 1
export SPARK_EXECUTOR_MEMORY = 400M
export SPARK_DRIVER_MEMORY = 400M
export SPARK_YARN_APP_NAME = "Spark 1.0.2"
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/lib/hadoop/lib/hadoop-lzo-0.5.0.jar:/usr/lib/hadoop/lib/
4. Copy the same configuration on all worker nodes and start the services like below.

On Namenode (node2)

 $  $SPARK_HOME/sbin/start-master.sh 

On Datanode/Slaves nodes

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://node2:7077 &


5. Now you can run your sample pi program

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1  /home/mapred/spark1.0.2/lib/spark-examples-1.0.2-hadoop2.2.0.jar 10

You can check the result on JT URI