Setting up pyspark

Spark is a general-purpose cluster computing system.

Here are some of my notes in setting up:

download spark distribution with a package type of ‘pre-built for Hadoop 2/6 or later’ even though if you don’t have hadoop installed
Set the environment variable SPARK_HOME such as

E:\apps\spark\spark-1.4.1-bin-hadoop2.6

set environment variable PYSPARK_SUBMIT_ARGS to

--master local[2]

download hadoop distribution binary
set HADOOP_HOME to the unzipped folder
download the content of this repo and add them to the hadoop distribution folder (don’t replace existing files)

create ipython config

ipython profile create pyspark
edit file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py

# Configure the necessary Spark environment
import os
import sys

# Spark home
spark_home = os.environ.get('SPARK_HOME', None)

sys.path.insert(0, spark_home + "/python")

# If Spark V1.4.x is detected, then add ' pyspark-shell' to
# the end of the 'PYSPARK_SUBMIT_ARGS' environment variable
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.4" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args:
        pyspark_submit_args += " pyspark-shell"
    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

# Add the spark python sub-directory to the path
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.8.2.1-src.zip"))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, "python/pyspark/shell.py"))

This link helped me