Apache Spark Cluster Installation and Configuration Guide

Yuri Bondarenko

This is step by step guide of how to install and configure Apache Spark cluster on Linux.

Prerequisites

Hardware requirements

8+ GB RAM.
4-8 disks per node, configured without RAID.
8+ cores per node.

Software requirements

CentOS 7/RHEL 64 bit Operating System.
Java SE Development Kit 8 or greater.

Create a User for Spark

As root, create a user called zookeeper.
$ useradd spark
Set its password.
$ passwd spark
Your ZooKeeper user is now ready. Log into it.
$ su - spark

Download the Apache Spark

Download the 2.0.0 release into user folder.
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0- bin-hadoop2.7.tgz
Unpack downloaded archive.
$ tar -zxvf spark-2.0.0- bin-hadoop2.7.tgz
$ cd spark-2.0.0- bin-hadoop2.7

Configure the Spark Server

Place a compiled version of Spark on each node of the cluster.
The simple cluster contains 1 master and 1 or more workers connected to the master node.
The following configuration options can be passed to the master and worker as argument when starting the service:

Argument	Meaning
`-h HOST, --host HOST`	Hostname to listen on
`-i HOST, --ip HOST`	Hostname to listen on (deprecated, use -h or –host)
`-p PORT, --port PORT`	Port for service to listen on (default: 7077 for master, random for worker)
`--webui-port PORT`	Port for web UI (default: 8080 for master, 8081 for worker)
`-c CORES, --cores CORES`	Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker
`-m MEM, --memory MEM`	Total amount of memory to allow Spark applications to use on the machine, in a format like 1000M or 2G (default: your machine’s total RAM minus 1 GB); only on worker
`-d DIR, --work-dir DIR`	Directory to use for scratch space and job output logs (default: SPARK_HOME/work); only on worker
`--properties-file FILE`	Path to a custom Spark properties file to load (default: conf/spark-defaults.conf)

High Availability

High Availability can be provided by standby masters with ZooKeeper.

Each master need to be connected to same ZooKeeper instance.

One will be elected “leader” and the others will remain in standby mode.

If the current leader dies, another Master will be elected, recover the old Master’s state, and then resume scheduling.

The entire recovery process (from the time the first leader goes down) should take between 1 and 2 minutes. (Note that this delay only affects scheduling new applications – applications that were already running during Master fail-over are unaffected.)

Start the ZooKeeper.
Create a configuration file ha.conf with the content as follows:

spark.deploy.recoveryMode=ZOOKEEPER spark.deploy.zookeeper.url=<zookeeper_host>:2181 spark.deploy.zookeeper.dir=/spark

Start the first Master server:

$ ./sbin/start-master.sh -h localhost -p 7077 --webui-port 8080 --properties-file ha.conf

Start the second Master server:

$ ./sbin/start-master.sh -h localhost -p 7077 --webui-port 8080 --properties-file ha.conf

Start a standalone Worker:

$ ./sbin/start-slave.sh spark://<master1>:7077,<master2>:7077

Start Spark shell.

$ ./bin/spark-shell --master spark://<master1>:7077,<master2>:7077

Wait till the Spark shell connects to an active standalone Master.
Find out which standalone Master is active (there can only be one). Kill it.
Observe how the other standalone Master takes over and lets the Spark shell register with itself.
Check out the master’s UI.
Optionally, kill the worker, make sure it goes away instantly in the active master’s logs.

Starting the Server

To start a standalone master server execute:

$ ./sbin/start-master.sh

To start one or more workers and connect them to the master via:

$ ./sbin/start-slave.sh <master-spark-URL>

Checking the Status

In order to check the status of the Spark cluster go to web UI interface with default port 8080 for master and 8081 for worker:

http://<serverHost>:8080/ http://<serverHost>:8081/