Apache Spark Installation on Windows: Step-by-Step Guide

Setting up Apache Spark on a Windows machine requires careful attention to system dependencies and environment configuration. This guide walks through the process step by step, ensuring that developers and data engineers can get started quickly. Unlike Linux or macOS, Windows needs specific adjustments to run Spark efficiently. The following instructions assume a clean installation of Windows 10 or 11 with administrative access.

Preparing the Windows Environment

Before installing Apache Spark, prepare the operating system to avoid common pitfalls. Java and Scala must be installed and accessible from the command line. Without Java, Spark cannot run because it is built on the JVM. Verify that these prerequisites are in place to prevent delays later in the process.

Installing Java Development Kit

Download the latest Long-Term Support (LTS) version of Java from Oracle or AdoptOpenJDK. Set the JAVA_HOME environment variable to point to the JDK installation directory. Add the JDK bin folder to the system PATH to execute java and javac from any terminal. Restart the command prompt to ensure the changes take effect globally.

Configuring Scala

Spark applications are often written in Scala, so the Scala runtime is necessary. Choose the correct version that matches the Spark release you plan to use. Extract the Scala archive to a dedicated folder and define the SCALA_HOME variable. Include the Scala bin directory in the PATH to enable command-line access.

Downloading and Setting Up Apache Spark

Visit the official Apache Spark website to download the latest pre-built package for Hadoop. Choose the binary distribution without Hadoop if you plan to manage Hadoop separately. Avoid downloading sources unless you intend to build Spark from scratch, as this adds unnecessary complexity. Place the extracted folder in a location with sufficient disk space and permissions.

Configuring Environment Variables

Define SPARK_HOME to point to the root directory of the Spark installation. Add %SPARK_HOME%\bin to the system PATH to run Spark commands from any directory. Verify the configuration by opening a new command prompt and checking the Spark version. This step confirms that the installation is correctly recognized by the system.

Variable

Value Example

Purpose

JAVA_HOME

C:\Program Files\Java\jdk-17

Points to Java installation

SCALA_HOME

C:\Tools\scala

Points to Scala installation

SPARK_HOME

C:\Tools\spark-3.5.0-bin-hadoop3

Points to Spark installation

PATH

…;%SPARK_HOME%\bin

Enables command-line access

Running Spark in Local Mode

With environment variables set, launch the Spark shell to test the installation. Use the command spark-shell to start an interactive Scala session. Observe the console output for any warnings or errors related to Hadoop libraries. Successful startup is indicated by the presence of the Spark context and UI URL.

Run a simple command such as sc.parallelize(1 to 100).sum() to verify that computations work correctly. This test confirms that Spark can execute tasks locally without connecting to a cluster. If the shell starts and responds to commands, the installation on Windows is complete and functional.