Notes on setting up pyspark and jupyter notebook.
Spark is a generalpurpose cluster computing system.
In a previous , I showed how to setup spark with ipython.
The ipythonnotebook went through some changes such that the previous setup, didn’t work for me.
Here are my notes for getting it to work, assuming a clean install
Download a version of spark with a package type of prebuilt for hadoop 2/6 or later
1  wget http://apache.arvixe.com/spark/spark1.5.0/spark1.5.0binhadoop2.6.tgz 
Unzip folder
1  tar xvf spark1.5.0binhadoop2.6.tgz 
Set environment variables Spark Home and update python path
1  nano ~/.bashrc 
Test if thing works
There are two ways to start spark:
1 a) Using spark to connect to local cluster
On the terminal run
1  IPYTHON_OPTS="notebook" $SPARK_HOME/bin/pyspark 
The resulting ipython notebook will have the spacrk context intialized to the locahost.
The spark context is availble as the variable sc
1 b) Using spark to connect to a possibly remote cluster
Start ipython notebook
1  ipython notebook 
Type the below code
1  #import statements 
2) sample code
Here is a sample code to check if a number is prime
1 

def isprime(n):
“””
check if integer n is a prime
“””
# make sure n is a positive integer
n = abs(int(n))
# 0 and 1 are not primes
if n < 2:
return False
# 2 is the only even prime number
if n == 2:
return True
# all other even numbers are not primes
if not n & 1:
return False
# range starts with 3 and only needs to go up the square root of n
# for all odd numbers
for x in range(3, int(n**0.5)+1, 2):
if n % x == 0:
return False
return True
Create an RDD of numbers from 0 to 1,000,000
nums = sc.parallelize(xrange(1000000))
Compute the number of primes in the RDD
print nums.filter(isprime).count()
1 

sc.setLogLevel(‘WARN’)
1 

cp $SPARK_HOME/conf/log4j.properties.template $SPARK_HOME/conf/log4j.properties
nano $SPARK_HOME/conf/log4j.properties
1 

log4j.rootCategory=INFO, console
1  to 
log4j.rootCategory=WARN, console
```