Whether it’s for social science, marketing, business intelligence or something else, the number of times data analysis benefits from heavy duty parallelization is growing all the time.
Apache Spark is an awesome platform for big data analysis, so getting to know how it works and how to use it is probably a good idea. Setting up your own cluster, administering it etc. etc. is a bit of a hassle to just learn the basics though (although Amazon EMR or Databricks make that quite easy, and you can even build your own Raspberry Pi cluster if you want…), so getting Spark and Pyspark running on your local machine seems like a better idea. You can also use Spark with
R and Scala, among others, but I have no experience with how to set that up. So, we’ll stick to Pyspark in this guide.
While dipping my toes into the water I noticed that all the guides I could find online weren’t entirely transparent, so I’ve tried to compile the steps I actually did to get this up and running here. The original guides I’m working from are here, here and here.
Before we can actually install Spark and Pyspark, there are a few things that need to be present on your machine.
- The XCode Developer Tools
- Java (v8)
Installing the XCode Developer Tools
- Open Terminal
- Confirm, proceed with install
- Open Terminal
- Paste the command listed on the brew homepage: brew.sh
At the time of this writing, that’s
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
- Open Terminal
brew install pipenv
If that doesn’t work for some reason, you can do the following:
- Open Terminal
pip install --user pipenv
This does a pip user install, which puts pipenv in your home directory. If pipenv isn’t available in your shell after installation, you need to add stuff to you
PATH. Here’s pipenv’s guide on how to do that.
To run, Spark needs Java installed on your system. It’s important that you do not install Java with
brew for uninteresting reasons.
Just go here to download Java for your Mac and follow the instructions.
You can confirm Java is installed by typing
$ java --showversion in Terminal.
With the pre-requisites in place, you can now install Apache Spark on your Mac.
Go to the Apache Spark Download Page
Download the newest version, a file ending in
Unzip this file in Terminal
$ tar -xzf spark-2.4.3-bin-hadoop2.7.tgz
Move the file to your
$ sudo mv spark-2.4.3-bin-hadoop2.7 /opt/spark-2.4.3
Create a symbolic link (symlink) to your Spark version
$ sudo ln -s /opt/spark-2.4.3 /opt/spark
What’s happening here? By creating a symbolic link to our specific version (2.4.3) we can have multiple versions installed in parallel and only need to adjust the symlink to work with them.
Tell your shell where to find Spark
Until macOS 10.14 the default shell used in the Terminal app was
bash, but from 10.15 on it is Zshell (
zsh). So depending on your version of macOS, you need to do one of the following:
$ nano ~/.bashrc(10.14)
$ nano ~/.zshrc(10.15)
Set Spark variables in your
# Spark export SPARK_HOME="/opt/spark" export PATH=$SPARK_HOME/bin:$PATH export SPARK_LOCAL_IP='127.0.0.1'
I recommend that you install Pyspark in your own virtual environment using
pipenv to keep things clean and separated.
Make yourself a new folder somewhere, like
~/coding/pyspark-projectand move into it
$ cd ~/coding/pyspark-project
Create a new environment
$ pipenv --threeif you want to use Python 3
$ pipenv --twoif you want to use Python 2
$ pipenv install pyspark
$ pipenv install jupyter
Now tell Pyspark to use Jupyter: in your
export PYSPARK_DRIVER_PYTHON=jupyter export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
If you want to use Python 3 with Pyspark (see step 3 above), you also need to add:
~/.zshrc should now have a section that looks kinda like this:
172 # Spark 173 export SPARK_HOME="/opt/spark" 174 export PATH=$SPARK_HOME/bin:$PATH 175 export SPARK_LOCAL_IP='127.0.0.1' 176 177 # Pyspark 178 export PYSPARK_DRIVER_PYTHON=jupyter 179 export PYSPARK_DRIVER_PYTHON_OPTS='notebook' 180 export PYSPARK_PYTHON=python3 # only if you're using Python 3
Now you save the file, and source your Terminal:
$ source ~/.bashrcor
$ source ~/.zshrc
To start Pyspark and open up Jupyter, you can simply run
$ pyspark. You only need to make sure you’re inside your pipenv environment. That means:
- Go to your pyspark folder (
$ cd ~/coding/pyspark project)
$ pipenv shell
Using Pyspark inside your Jupyter Notebooks
To test whether Pyspark is running as it is supposed to, put the following code into a new notebook and run it:
import numpy as np TOTAL = 10000 dots = sc.parallelize([2.0 * np.random.random(2) - 1.0 for i in range(TOTAL)]).cache() print("Number of random points:", dots.count()) stats = dots.stats() print('Mean:', stats.mean()) print('stdev:', stats.stdev())
(You might need to install
numpy inside your
pipenv environment if you haven’t already done so without my instruction 😉)
If you get an error along the lines of
sc is not defined, you need to add
sc = SparkContext.getOrCreate() at the top of the cell. If things are still not working, make sure you followed the installation instructions closely.
Still no luck? Send me an email if you want (I most definitely can’t guarantee that I know how to fix your problem), particularly if you find a bug and figure out how to make it work!