Spark is a staple in a data engineering toolset for distributed compute transformations. It does however have a level of overhead and dependencies that need to be sorted in order to set up. These are usually handled by Cloud Services natively (e.g. GCP Dataproc)
But if you are just getting started with learning Pyspark - or just want a lightweight and fast way to test some sytax locally, spinning up a cluster is additional complexity you might not want to deal with (not to mention it cost money)
A lot of the tutorials out there are not too intuitve, or are missing some key aspects that make the development experince easier. So I decided to compile a list of steps to get you started with pyspark locally quickly. Hope you find it useful!
Prereqs
- Python (will be using verison 3.10)
Check if you have python by running
|
|
Install Java
Check if you already have it installed
|
|
If not, go here and follow the installation instructions
Install Spark
The following commands are for Mac (Bash) - if you are running a bash shell in Windows these should work. Otherwise, they might need to be slightly altered to run in powershell
-
Go to the Spark Download page and get the latest version of spark
-
Unzip the downloaded
.tgz
file (zipped file) -
Move the file to your
/opt/
folder in Mac:
|
|
In Windows you might not have this folder. The name or location is not really important as long as you have it somewhere it can’t be accidentally moved or deleted since it will be reference any time you spin up a notebook
- Create a symbolic link to make it easier to switch Spark versions
If you want to install a different version of Spark in the future, you can just adjust the symlink and everything else will still work…
|
|
Install Pyspark and Jupyter
To keep things clean it is better to always use virtual environments when installing python modules. This can be done in many ways: conda envs, pipenv, venv, …
Here I will be demonstrating with venv
- Create a new directory to work from:
mkdir jupyter-spark
cd jupyter-spark
- Create a virtual environment:
python -m venv .pyspark-en
- Activate the virtual environment:
source .pyspark-env/bin/activate
- Install pyspark and jupyterlab:
pip install pyspark jupyterlab
Updating your shell rc file
Now you need to tell your shell where to find Spark - this is done by setting some environment variables
This can be done a variety of ways, but to make it as seamless as possible we will be handling it in the ~/.bashrc
Note: depending on your shell, this step will be different. For example, if you use zsh as your shell, you will need to modify your ~/.zshrc
instead
Add the following environment variables:
|
|
Save, exit and restart your shell
Create a Pyspark Notebook
Done with the setup, now anytime you want to start your pyspark instance with JupyterLab locally you just need to:
- cd into your directory where you installed pyspark
- activate the virtual environment
- run the command
pyspark
in your shell
This will open a JupyterLab instance in your default browser, and you are good to go!