Introduction - Apache Spark with Python Tutorial

Apache Spark is an open source cluster computing framework. Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines that is maintained in a fault-tolerant way.

 

The Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities, exposed through an application programming interface (for Java, Python, Scala, and R) centered on the RDD abstraction.   We will focus on Python.

 

Spark SQL is a component on top of Spark Core that introduces a new data abstraction called DataFrames. DataFrames provides support for structured and semi-structured data. Spark SQL provides a language to manipulate DataFrames in Python. It also provides SQL language support, with command-line interfaces and ODBC server.

 

Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, on a single engine.

 

Spark MLlib is a distributed machine learning framework. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large-scale machine learning pipelines.
Included are:

* Summary statistics, correlations, stratified sampling, hypothesis testing, and random data generation

 

* Classification and regression: support vector machines, logistic regression, linear regression, decision trees, naive Bayes classification collaborative filtering techniques including alternating least squares (ALS)

 

* Collaborative filtering techniques including alternating least squares (ALS)

 

* Cluster analysis methods including k-means, and Latent Dirichlet Allocation (LDA)dimensionality reduction techniques such as singular value decomposition (SVD), and principal component analysis (PCA)

 

* Dimensionality reduction techniques such as singular value decomposition (SVD), and principal component analysis (PCA)

 

* feature extraction and transformation function optimization algorithms such as stochastic gradient descent, limited-memory BFGS (L-BFGS)

 

* Optimization algorithms such as stochastic gradient descent, limited-memory BFGS (L-BFGS)

 

GraphX is a distributed graph processing framework on top of Spark. It provides an API for expressing graph computation that can model the Pregel abstraction (consist of supersteps). It also provides an optimized runtime for this abstraction.

 

If you have a strong Mathematics/Statistics background, you will probably be a great fit for Apache Spark!

 

 

 

 

 Alan Levin

Senior Technical Project Manager

Check our Next Webinars

Subscribe

Recent Post