Apache Spark, Python

Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Python and an optimized engine that supports general computation graphs for data analysis. It also supports Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing. http://spark.apache.org/

TOOLS

Apache Spark
Anaconda (iPython, Scipy, Pandas, etc)

Install & Configure Spark on Mac and Use iPython Notebook

https://github.com/mGalarnyk/Installations_Mac_Ubuntu_Windows/blob/master/Spark/Install_Apache_Spark_PySpark_Mac.ipynb

Firing Interactive Python (iPython) with Spark

# Check Spark is ready (after intalling Java SDK and unpacking Spark)
>>$ ./bin/pyspark

# Launch iPython with Spark (Python 2.7)
>>$ IPYTHON_OPTS="notebook" ./bin/pyspark

# With Python 3
>>$ IPYTHON_OPTS='notebook' PYSPARK_PYTHON=python3

# Run Spark Cluster - Master. Notice the Cluster URL
(spark://UserName.local:7077) that can be found at http://localhost:8080
>>$ ./sbin/start-master.sh


# Create & Register New Worker on Master Cluster
>>$ ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://UserName.local:7077

# Submit job to Cluster (using example python program in Spark)
>>$ bin/spark-submit --master spark://UserName.local:7077 examples/src/main/python/pi.py

# Write and deploy own python program to the Spark Cluster
# Spark Packages required: http://spark-packages.org

>>$ bin/spark-submit --master spark://UserName.local:7077 --packages com.databricks:spark-csv_2.10:1.3.0 mySpark_files/uberstats.py mySpark_files/Uber-Jan-Feb-FOIL.csv

Submit Job to Amazon EC2

# Create AWS USER profiles: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from
https://console.aws.amazon.com/.

* Create User Profile
* Create security credentials and download a .pem file.
* Create permissions for the User Profile

# Set Amazon EC2 environment variables
>>$ export AWS_SECRET_ACCESS_KEY=<Some Keys>
>>$ export AWS_ACCESS_KEY_ID=<AKIAJXSITV4RAGQPGA6A>

# Create & Launch new Python Spark Cluster on Amazon EC2.
>>$ chmod 400 downloaded.pem
>>$ ./ec2/spark-ec2 --key-pair=spark_python --identity-file=downloaded.pem --zone=us-west-2a launch spark-cluster-name

# Running iPython console after creating Cluster where ec2-54-198-139-10.compute-1.amazonaws.com is the public DNS created my MASTER
>>$ bin/spark-shell --master spark://ec2-54-198-139-10.compute-1.amazonaws.com:7077

Destroy Cluster after usage

./ec2/spark-ec2 destroy spark-cluster-name

CSV File - Querying structured data using Spark SQL

# Spark SQL with iPython on CSV file. We pass in a "spark_csv" data source
# package found and documented at: http://spark-packages.org

>>$ IPYTHON_OPTS="notebook" ./bin/pyspark --packages com.databricks:spark-csv_2.11:1.4.0


# launch Pyspark with Python 3.4 in iPython and the CSV package
>>$ IPYTHON_OPTS="notebook" PYSPARK_PYTHON ./bin/Pyspark --packages com.databricks:spark-csv_2.11:1.2.0

RDBMS - Querying structured data using Spark SQL

# Spark SQL on structured RDBMS database. For this, we need a Java database Connectivity (JDBC) driver on path.
# This example uses the mySQL and automatically downloads the JDBC driver
from https://dev.mysql.com/downloads/connector/j/

>>$ IPYTHON_OPTS="notebook" ./bin/pyspark --jars mysql-connector-java-5.1.38-bin.jar

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
csv		csv
metastore_db		metastore_db
.gitignore		.gitignore
CSV - PySpark SQL with Viewership.ipynb		CSV - PySpark SQL with Viewership.ipynb
CSV - Using Spark SQL with Python on CSV.ipynb		CSV - Using Spark SQL with Python on CSV.ipynb
Movie Ratings with Spark, and MLib.ipynb		Movie Ratings with Spark, and MLib.ipynb
RDBMS - Using Spark SQL with Python on RDBMS.ipynb		RDBMS - Using Spark SQL with Python on RDBMS.ipynb
README.md		README.md
Spark - Composing Feature Vector for Machine Learning.ipynb		Spark - Composing Feature Vector for Machine Learning.ipynb
Word Count on TextFile using PySpark.ipynb		Word Count on TextFile using PySpark.ipynb
all-world-cup-players.json		all-world-cup-players.json
derby.log		derby.log
spark_2_pycharm.py		spark_2_pycharm.py
uberstats.py		uberstats.py
words_to_count.txt		words_to_count.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Apache Spark, Python

TOOLS

Install & Configure Spark on Mac and Use iPython Notebook

Firing Interactive Python (iPython) with Spark

Submit Job to Amazon EC2

Destroy Cluster after usage

CSV File - Querying structured data using Spark SQL

RDBMS - Querying structured data using Spark SQL

About

Uh oh!

Releases

Packages

Languages

RichardAfolabi/Python-Spark

Folders and files

Latest commit

History

Repository files navigation

Apache Spark, Python

TOOLS

Install & Configure Spark on Mac and Use iPython Notebook

Firing Interactive Python (iPython) with Spark

Submit Job to Amazon EC2

Destroy Cluster after usage

CSV File - Querying structured data using Spark SQL

RDBMS - Querying structured data using Spark SQL

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages