Apache Spark

Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from Netflix.

Packages

python-deequ

Python API for Deequ.

Packages

quinn

A native PySpark implementation of spark-daria.

Packages

Spark Cassandra Connector

Cassandra support including data source and API and support for arbitrary queries.

Packages

Spark XML

XML parser and writer.

Packages

spark-connect-csharp

C# bindings.

Packages

spark-connect-go

Golang bindings.

Packages

spark-connect-rs

Rust bindings.

Packages

spark-daria

A Scala library with essential Spark functions and extensions to make you more productive.

Packages

spark-fast-tests

A lightweight and fast testing framework.

Packages

spark-jobserver

Simple Spark as a Service which supports objects sharing using so called named objects. JVM only.

Packages

spark-nlp

Natural language processing library built on top of Apache Spark ML.

Packages

spark-testing-base

Collection of base test classes.

Packages

sparkle

Haskell on Apache Spark.

Packages

Sparkling Water

H2O interoperability layer.

Packages

sparkly

Helpers & syntactic sugar for PySpark.

Packages

sparklyr

An alternative R backend, using `dplyr`.

Packages

sparkmagic

Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy, in Jupyter notebooks.

Packages

Resources(21 items)

Advanced Analytics with Spark

Useful collection of Spark processing patterns. Accompanying GitHub repository: sryza/aas.

Resources

AMP Camp

Periodical training event organized by the UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.

Resources

Apache Spark User List

and Apache Spark Developers List - Mailing lists dedicated to usage questions and development topics respectively.

Resources

apache/spark

Apache Spark Official Docker images.

Resources

Big Data Analysis with Scala and Spark (Coursera)

Scala oriented introductory course. Part of Functional Programming in Scala Specialization.

Resources

Crossdata

Data integration platform with extended DataSource API and multi-user environment.

Resources

Data Science and Engineering with Apache Spark ...

Series of five courses (Introduction to Apache Spark, Distributed Machine Learning with Apache Spark, Big Data Analysis with Apache Spark, Advanced Apache Spark for Data Science and Data Engineering, Advanced Distributed Machine Learning with Apache Spark) covering different aspects of software engineering and data science. Python oriented.

Resources

datamechanics/spark

An easy to setup Docker image for Apache Spark from Data Mechanics.

Resources

jupyter/docker-stacks/pyspark-notebook

PySpark with Jupyter Notebook and Mesos client.

Resources

Large-Scale Intelligent Microservices

Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives.

Resources

Learning Spark, 2nd Edition

Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts.

Resources

Mastering Apache Spark

Interesting compilation of notes by Jacek Laskowski. Focused on different aspects of Spark internals.

Resources

Oryx 2

Lambda architecture platform built on Apache Spark and Apache Kafka with specialization for real-time large scale machine learning.

Resources

Photon ML

A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.

Resources

PredictionIO

Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.

Resources

Resilient Distributed Datasets: A Fault-Toleran...

Paper introducing a core distributed memory abstraction.

Resources

sequenceiq/docker-spark

Yarn images from SequenceIQ.

Resources

Spark in Action

New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to setup Eclipse for Spark application development and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo here.

Resources

Spark SQL: Relational Data Processing in Spark

Paper introducing relational underpinnings, code generation and Catalyst optimizer.

Resources

Spark with Scala Gitter channel

"A place to discuss and ask questions about using Scala for Spark programming" started by @deanwampler.

Resources

Structured Streaming: A Declarative API for Rea...

Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query.

Resources