This is step by step guide of how to install and configure Apache Spark cluster on Linux.
Prerequisites
Hardware requirements
- 8+ GB RAM.
- 4-8 disks per node, configured without RAID.
- 8+ cores per node.
Software requirements
- CentOS 7/RHEL 64 bit Operating System.
- Java SE Development Kit 8 or greater.
Create a User for Spark
- As root, create a user called zookeeper.
$ useradd spark
- Set its password.
$ passwd spark
- Your ZooKeeper user is now ready. Log into it.
$ su - spark
Download the Apache Spark
- Download the 2.0.0 release into user folder.
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0- bin-hadoop2.7.tgz
- Unpack downloaded archive.
$ tar -zxvf spark-2.0.0- bin-hadoop2.7.tgz
$ cd spark-2.0.0- bin-hadoop2.7
Configure the Spark Server
- Place a compiled version of Spark on each node of the cluster.
- The simple cluster contains 1 master and 1 or more workers connected to the master node.
- The following configuration options can be passed to the master and worker as argument when starting the service:
Argument | Meaning |
-h HOST, --host HOST |
Hostname to listen on |
-i HOST, --ip HOST |
Hostname to listen on (deprecated, use -h or –host) |
-p PORT, --port PORT |
Port for service to listen on (default: 7077 for master, random for worker) |
--webui-port PORT |
Port for web UI (default: 8080 for master, 8081 for worker) |
-c CORES, --cores CORES |
Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker |
-m MEM, --memory MEM |
Total amount of memory to allow Spark applications to use on the machine, in a format like 1000M or 2G (default: your machine’s total RAM minus 1 GB); only on worker |
-d DIR, --work-dir DIR |
Directory to use for scratch space and job output logs (default: SPARK_HOME/work); only on worker |
--properties-file FILE |
Path to a custom Spark properties file to load (default: conf/spark-defaults.conf) |
High Availability
High Availability can be provided by standby masters with ZooKeeper.
Each master need to be connected to same ZooKeeper instance.
One will be elected “leader” and the others will remain in standby mode.
If the current leader dies, another Master will be elected, recover the old Master’s state, and then resume scheduling.
The entire recovery process (from the time the first leader goes down) should take between 1 and 2 minutes. (Note that this delay only affects scheduling new applications – applications that were already running during Master fail-over are unaffected.)
- Start the ZooKeeper.
- Create a configuration file ha.conf with the content as follows:
spark.deploy.recoveryMode=ZOOKEEPER
spark.deploy.zookeeper.url=<zookeeper_host>:2181
spark.deploy.zookeeper.dir=/spark
- Start the first Master server:
$ ./sbin/start-master.sh -h localhost -p 7077 --webui-port 8080 --properties-file ha.conf
- Start the second Master server:
$ ./sbin/start-master.sh -h localhost -p 7077 --webui-port 8080 --properties-file ha.conf
- Start a standalone Worker:
$ ./sbin/start-slave.sh spark://<master1>:7077,<master2>:7077
- Start Spark shell.
$ ./bin/spark-shell --master spark://<master1>:7077,<master2>:7077
- Wait till the Spark shell connects to an active standalone Master.
- Find out which standalone Master is active (there can only be one). Kill it.
- Observe how the other standalone Master takes over and lets the Spark shell register with itself.
- Check out the master’s UI.
- Optionally, kill the worker, make sure it goes away instantly in the active master’s logs.
Starting the Server
- To start a standalone master server execute:
$ ./sbin/start-master.sh
- To start one or more workers and connect them to the master via:
$ ./sbin/start-slave.sh <master-spark-URL>
Checking the Status
- In order to check the status of the Spark cluster go to web UI interface with default port 8080 for master and 8081 for worker:
http://<serverHost>:8080/
http://<serverHost>:8081/