Apache Spark

Unified engine for large-scale data processing.

78 resources2 categoriesView Original

Packages(57 items)

.

.NET for Apache Spark

.NET bindings.

Packages
A

ADAM

Set of tools designed to analyse genomics data.

Packages
A

almond

A scala kernel for Jupyter.

Packages
A

Apache Bahir

Collection of the streaming connectors excluded from Spark 2.0 (Akka, MQTT, Twitter. ZeroMQ).

Packages
A

Apache Beam

Unified data processing engine supporting both batch and streaming applications. Apache Spark is one of the supported execution environments.

Packages
A

Apache DataFu

A library of general purpose functions and UDF's.

Packages
A

Apache Hudi

Upserts, Deletes And Incremental Processing on Big Data..

Packages
A

Apache Iceberg

Upserts, Deletes And Incremental Processing on Big Data..

Packages
A

Apache Kyuubi

A distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark.

Packages
A

Apache Sedona

Cluster computing system for processing large-scale spatial data.

Packages
A

Apache SystemML

Declarative machine learning framework on top of Spark.

Packages
A

Apache Toree

IPython protocol based middleware for interactive applications.

Packages
A

Apache Zeppelin

Web-based notebook that enables interactive data analytics with plugable backends, integrated plotting, and extensive Spark support out-of-the-box.

Packages
A

Archives Unleashed Toolkit

Open-source toolkit for analyzing web archives.

Packages
B

BigDL

Distributed Deep Learning library.

Packages
C

chispa

PySpark test helpers with beautiful error messages.

Packages
C

Cromwell

Workflow management system with Spark backend.

Packages
D

Data Mechanics Delight

Cross-platform monitoring tool (Spark UI / Spark History Server replacement).

Packages
D

deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Packages
D

Delta Lake

Storage layer with ACID transactions.

Packages
F

Flintrock

A command-line tool for launching Spark clusters on EC2.

Packages
G

GraphFrames

Data frame based graph API.

Packages
H

Hail

Genetic analysis framework.

Packages
I

itachi

A library that brings useful functions from modern database management systems to Apache Spark.

Packages
J

Joblib Apache Spark Backend

`joblib` backend for running tasks on Spark clusters.

Packages
J

JPMML-Spark

PMML transformer library for Spark ML.

Packages
K

KeystoneML

Type safe machine learning pipelines with RDDs.

Packages
K

Koalas

Pandas DataFrame API on top of Apache Spark.

Packages
K

Kotlin for Apache Spark

Kotlin API bindings and extensions.

Packages
L

lakeFS

Integration with the lakeFS atomic versioned storage layer.

Packages
L

Livy

REST server with extensive language support (Python, R, Scala), ability to maintain interactive sessions and object sharing.

Packages
M

Mahout Spark Bindings

\[status unknown\] - linear algebra DSL and optimizer with R-like syntax.

Packages
M

Microsoft ML for Apache Spark

A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment.

Packages
M

MLeap

Execution engine and serialization format which supports deployment of `o.a.s.ml` models without dependency on `SparkSession`.

Packages
M

MLflow

Machine learning orchestration platform.

Packages
M

ModelDB

A system to manage machine learning models for `spark.ml` and `scikit-learn` .

Packages
M

Mongo-Spark

Official MongoDB connector.

Packages
N

neo4j-spark-connector

Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.

Packages
O

Optimus

Data Cleansing and Exploration utilities with the goal of simplifying data cleaning.

Packages
P

Polynote

Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from Netflix.

Packages
P

python-deequ

Python API for Deequ.

Packages
Q

quinn

A native PySpark implementation of spark-daria.

Packages
S

Spark Cassandra Connector

Cassandra support including data source and API and support for arbitrary queries.

Packages
S

Spark XML

XML parser and writer.

Packages
S

spark-connect-csharp

C# bindings.

Packages
S

spark-connect-go

Golang bindings.

Packages
S

spark-connect-rs

Rust bindings.

Packages
S

spark-daria

A Scala library with essential Spark functions and extensions to make you more productive.

Packages
S

spark-fast-tests

A lightweight and fast testing framework.

Packages
S

spark-jobserver

Simple Spark as a Service which supports objects sharing using so called named objects. JVM only.

Packages
S

spark-nlp

Natural language processing library built on top of Apache Spark ML.

Packages
S

spark-testing-base

Collection of base test classes.

Packages
S

sparkle

Haskell on Apache Spark.

Packages
S

Sparkling Water

H2O interoperability layer.

Packages
S

sparkly

Helpers & syntactic sugar for PySpark.

Packages
S

sparklyr

An alternative R backend, using `dplyr`.

Packages
S

sparkmagic

Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy, in Jupyter notebooks.

Packages

Resources(21 items)

A

Advanced Analytics with Spark

Useful collection of Spark processing patterns. Accompanying GitHub repository: sryza/aas.

Resources
A

AMP Camp

Periodical training event organized by the UC Berkeley AMPLab. A source of useful exercise and recorded workshops covering different tools from the Berkeley Data Analytics Stack.

Resources
A

Apache Spark User List

and Apache Spark Developers List - Mailing lists dedicated to usage questions and development topics respectively.

Resources
A

apache/spark

Apache Spark Official Docker images.

Resources
B

Big Data Analysis with Scala and Spark (Coursera)

Scala oriented introductory course. Part of Functional Programming in Scala Specialization.

Resources
C

Crossdata

Data integration platform with extended DataSource API and multi-user environment.

Resources
D

Data Science and Engineering with Apache Spark ...

Series of five courses (Introduction to Apache Spark, Distributed Machine Learning with Apache Spark, Big Data Analysis with Apache Spark, Advanced Apache Spark for Data Science and Data Engineering, Advanced Distributed Machine Learning with Apache Spark) covering different aspects of software engineering and data science. Python oriented.

Resources
D

datamechanics/spark

An easy to setup Docker image for Apache Spark from Data Mechanics.

Resources
J

jupyter/docker-stacks/pyspark-notebook

PySpark with Jupyter Notebook and Mesos client.

Resources
L

Large-Scale Intelligent Microservices

Microsoft paper that presents an Apache Spark-based micro-service orchestration framework that extends database operations to include web service primitives.

Resources
L

Learning Spark, 2nd Edition

Introduction to Spark API with Spark 3.0 covered. Good source of knowledge about basic concepts.

Resources
M

Mastering Apache Spark

Interesting compilation of notes by Jacek Laskowski. Focused on different aspects of Spark internals.

Resources
O

Oryx 2

Lambda architecture platform built on Apache Spark and Apache Kafka with specialization for real-time large scale machine learning.

Resources
P

Photon ML

A machine learning library supporting classical Generalized Mixed Model and Generalized Additive Mixed Effect Model.

Resources
P

PredictionIO

Machine Learning server for developers and data scientists to build and deploy predictive applications in a fraction of the time.

Resources
R

Resilient Distributed Datasets: A Fault-Toleran...

Paper introducing a core distributed memory abstraction.

Resources
S

sequenceiq/docker-spark

Yarn images from SequenceIQ.

Resources
S

Spark in Action

New book in the Manning's "in action" family with +400 pages. Starts gently, step-by-step and covers large number of topics. Free excerpt on how to setup Eclipse for Spark application development and how to bootstrap a new application using the provided Maven Archetype. You can find the accompanying GitHub repo here.

Resources
S

Spark SQL: Relational Data Processing in Spark

Paper introducing relational underpinnings, code generation and Catalyst optimizer.

Resources
S

Spark with Scala Gitter channel

"A place to discuss and ask questions about using Scala for Spark programming" started by @deanwampler.

Resources
S

Structured Streaming: A Declarative API for Rea...

Structured Streaming is a new high-level streaming API, it is a declarative API based on automatically incrementalizing a static relational query.

Resources