How to Install Apache Spark on Ubuntu Linux
You install Apache Spark on Ubuntu Linux to efficiently process and analyze large datasets across distributed systems.
Apache Spark is an open-source, distributed computing system designed for big data workloads. It provides powerful APIs for batch and stream processing, machine learning, and graph computation.
This guide helps you install Spark version 3.5.1 on your Ubuntu 22.04 or 20.04 system. You will set up Spark for distributed computing, enabling you to tackle complex data science tasks on your PC.
Install Java JDK and Scala using `sudo apt install default-jdk scala`. Download and extract Spark from its official site, then move it to `/opt/spark`. Set `SPARK_HOME` and update your PATH in `~/.bashrc`. Finally, start Spark with `start-master.sh` and `start-slave.sh`.
Install Java JDK
Apache Spark requires Java JDK. In Ubuntu, the commands below can install the latest version.
sudo apt update sudo apt install default-jdk
After installing, run the commands below to verify the version of Java installed.
java --version
This should show you lines like the ones below:
openjdk 11.0.10 2021-01-19 OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04) OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)
Install Scala
You’ll also need a package called Scala to run Apache Spark. To install it in Ubuntu, run the commands below:
sudo apt install scala
To verify the version of Scala installed, run the commands below:
scala -version
This should display a line like this:
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
Install Apache Spark
Now that you have installed the required packages to run Apache Spark, continue below to install it.
Download the latest version of Spark.
cd /tmp wget https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
Then, extract the downloaded file and move it to the /opt directory.
tar -xvzf spark-2.4.6-bin-hadoop2.7.tgz sudo mv spark-2.4.6-bin-hadoop2.7 /opt/spark
After that, create the necessary environment variables to run Spark.
nano ~/.bashrc
Add the following lines to the bottom of the file and save.
export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Finally, apply your environment changes by running these commands.
source ~/.bashrc
Start Apache Spark
At this point, Apache Spark is installed and ready to use. Run the commands below to start it up.
start-master.sh
Next, start the Spark work process by running the commands below.
start-slave.sh spark://localhost:7077
You can swap `localhost` with your server’s hostname or IP address. When the process starts, open your browser and navigate to the server hostname or IP address.
http://localhost:8080
If you wish to connect to Spark via its command shell, run the commands below:
spark-shell
The commands above will launch Spark Shell.
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.4.6
/_/
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.10)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
That should do it!
Conclusion:
This post showed you how to install Apache Spark on Ubuntu 20.04 | 18.04. If you find any error above, please use the form below to report.
Was this guide helpful?
About the Author
Richard
Tech Writer, IT Professional
Richard, a writer for Geek Rewind, is a tech enthusiast who loves breaking down complex IT topics into simple, easy-to-understand ideas. With years of hands-on experience in system administration and enterprise IT operations, he’s developed a knack for offering practical tips and solutions. Richard aims to make technology more accessible and actionable. He's deeply committed to the Geek Rewind community, always ready to answer questions and engage in discussions.
No comments yet — be the first to share your thoughts!