15 Apr 2018 - Tutorial
Setting up Spark for use in Java in Windows is fairly easy if you know what to do. I will take you through the steps needed here.
We will use the following technologies, which you should already have installed and set up:
- Java 8
- Apache Maven
- IntelliJ IDEA (or another IDE set up to work with Maven)
Install Spark on Windows (PySpark) Michael Galarnyk Blocked Unblock Follow Following. The video above walks through installing spark on windows following the set of instructions below. Jul 18, 2017 - This guide is for beginners who are trying to install Apache Spark on a Windows machine, I will assume that you have a 64-bit windows version.
You should know how to work with Maven.
My set up uses the
D:
volume, but you should be able to substitute it for C:
if you prefer.Installing Spark
First, you need to download and install Apache Spark.Go to this page and download the archive named
spark-2.0.0-bin-hadoop2.7.tgz
.Extract the archive to
D:spark
such that you now have the folders D:sparkbin
etcetera.Now download Hadoop. Copy
binwinutils.exe
to the D:sparkbin
folder.Environment variables
Go to your system’s environment variables by typing “environment variables” in the Start menu and selecting “Edit the system environment variables”.Add two new variables under the “user variables” section:
HADOOP_HOME
with the valueD:spark
SPARK_HOME
with the valueD:sparkbin
Now edit the
PATH
variable and add two new entries:%HADOOP_HOME%
%SPARK_HOME%
Close all windows by clicking “OK”.
Testing the installation
Open a command prompt (Windows+R, enter
cmd
and press the Return key) and execute spark-shell.cmd
. This should launch the Spark shelland among other things print the Spark logo as ASCII art.Setting up a Maven project
Now we will create a Maven project so that we can use Spark from Java.Create a new Maven project with the quickstart archetype
maven-archetype-quickstart
. In IntelliJ you can do this throughFile > New > Project… and selecting Maven in the list, then checking “Create from archetype” and selecting the quickstart prototype.Open
pom.xml
and add the following repository:Also add these two dependencies:
Java
Now edit the
App.java
file that was created by the Maven archetype and enter this code below the package
statement:Now we just need an input text file. Andrej Karpathy has an example of Character level Recurrent Neural Networks on Github,and on the repository there is an input file available with some Shakespeare plays. Download the text file from the repository and save it in
srcmainresources
.You should now be able to run the main function in
App.java
and obtain a list of word counts after a lot of Spark output.Congratulations, you have set up Apache Spark for use with Java!
Related Posts
Apache Spark is a lightening fast cluster computing engine conducive for big data processing. In order to learn how to work on it currently there is a MOOC conducted by UC Berkley here. However, they are using a pre-configured VM setup specific for the MOOC and for the lab exercises. But I wanted to get a taste of this technology on my personal computer. I invested two days searching the internet trying to find out how to install and configure it on a windows based environment. And finally, I was able to come up with the following brief steps that lead me to a working instantiation of Apache Spark.
To install Spark on a windows based environment the following prerequisites should be fulfilled first.
Requirement 1:
- If you are a Python user then Install Python 2.6+ or above otherwise this step is not required. If you are not a python user then you also do not need to setup the python path as the environment variable
- Download a pre-built Spark binary for Hadoop. I chose Spark release 1.2.1, package type Pre-built for Hadoop 2.3 or later from here.
- Once downloaded I unzipped the *.tar file by using WinRar to the D drive. (You can unzip it to any drive on your computer)
- The benefit of using a pre-built binary is that you will not have to go through the trouble of building the spark binaries from scratch.
- Download and install Scala version 2.10.4 from here only if you are a Scala user otherwise this step is not required. If you are not a scala user then you also do not need to setup the scala path as the environment variable
- Download and install winutils.exe and place it in any location in the D drive. Actually, the official release of Hadoop 2.6 does not include the required binaries (like winutils.exe) which are required to run Hadoop. Remember, Spark is a engine built over Hadoop.
Setting up the PATH variable in Windows environment :
This is the most important step. If the Path variable is not properly setup, you will not be able to start the spark shell. Now how to access the path variable?
- Right click on Computer- Left click on Properties
- Click on Advanced System Settings
- Under Start up & Recovery, Click on the button labelled as “Environment Variable”
- You will see the window divided into two parts, the upper part will read User variables for username and the lower part will read System variables. We will create two new system variables, So click on “New” button under System variable
- Set the variable name as(in case JAVA is not installed on your computer then follow these steps). Next set the variable value as the JDK PATH. In my case it is(please type the path without the single quote)
- Similarly, create a new system variable and name it asSet the variable value as the Python Path on your computer. In my case it is(please type the path without the single quote)
- Create a new system variable and name it asSet the variable value as(Note: There is no need to install Hadoop. The spark shell only requires the Hadoop path which in this case holds the value to winutils that will let us compile the spark program on a windows environment.
- Create a new system variable and name it asAssign the variable value as the path to your Spark binary location. In my case it is in
NOTE: Apache Maven installation is an optional step. I am mentioning it here because I want to install SparkR a R version of Spark.
- Download Apache Maven 3.1.1 from here Choose Maven 3.1.1. (binary zip) and unpack it using WinZip or WinRAR. Create a new system variable and name it asAssign the both these variables the value as the path to your Maven binary location. In my case it is in
Now, all you have to do is append these four system variables namely JAVA_HOME, PYTHON_PATH, HADOOP_HOME & SPARK_HOME to your Path variable. Which can be done as follows
Click on Ok to close the Environment variable window and then similarly on System properties window.
How to start Spark on windows
To run spark on windows environment
- Open up the command prompt terminal
- Change directory to the location where the spark directory is. For example in my case its present in the D directory
- Navigate into the bin directory like cd bin
- Run the command spark-shell and you should see the spark logo with the scala prompt
- Open up the web browser and type localhost:4040 in the address bar and you shall see the Spark shell application UI
- To quit Spark, at the command prompt type
That is all to install and run a standalone spark cluster on a windows based environment. Hope this helps.