Setting up Apache Spark on a Windows machine requires careful attention to system dependencies and environment configuration. This guide walks through the process step by step, ensuring that developers and data engineers can get started quickly. Unlike Linux or macOS, Windows needs specific adjustments to run Spark efficiently. The following instructions assume a clean installation of Windows 10 or 11 with administrative access.
Preparing the Windows Environment
Before installing Apache Spark, prepare the operating system to avoid common pitfalls. Java and Scala must be installed and accessible from the command line. Without Java, Spark cannot run because it is built on the JVM. Verify that these prerequisites are in place to prevent delays later in the process.
Installing Java Development Kit
Download the latest Long-Term Support (LTS) version of Java from Oracle or AdoptOpenJDK. Set the JAVA_HOME environment variable to point to the JDK installation directory. Add the JDK bin folder to the system PATH to execute java and javac from any terminal. Restart the command prompt to ensure the changes take effect globally.
Configuring Scala
Spark applications are often written in Scala, so the Scala runtime is necessary. Choose the correct version that matches the Spark release you plan to use. Extract the Scala archive to a dedicated folder and define the SCALA_HOME variable. Include the Scala bin directory in the PATH to enable command-line access.
Downloading and Setting Up Apache Spark
Visit the official Apache Spark website to download the latest pre-built package for Hadoop. Choose the binary distribution without Hadoop if you plan to manage Hadoop separately. Avoid downloading sources unless you intend to build Spark from scratch, as this adds unnecessary complexity. Place the extracted folder in a location with sufficient disk space and permissions.
Configuring Environment Variables
Define SPARK_HOME to point to the root directory of the Spark installation. Add %SPARK_HOME%\bin to the system PATH to run Spark commands from any directory. Verify the configuration by opening a new command prompt and checking the Spark version. This step confirms that the installation is correctly recognized by the system.
Running Spark in Local Mode
With environment variables set, launch the Spark shell to test the installation. Use the command spark-shell to start an interactive Scala session. Observe the console output for any warnings or errors related to Hadoop libraries. Successful startup is indicated by the presence of the Spark context and UI URL.
Run a simple command such as sc.parallelize(1 to 100).sum() to verify that computations work correctly. This test confirms that Spark can execute tasks locally without connecting to a cluster. If the shell starts and responds to commands, the installation on Windows is complete and functional.