A web service that makes it easy to quickly and cost-effectively process vast amounts of data.
A light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via functions and processes data via columns operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.
A cloud-based platform deployed on Kubernetes making Apache Spark more developer-friendly and cost-effective.
Connecting Apache Spark with different data stores. Deprecated.
A free & cross platform monitoring tool (Spark UI / Spark History Server alternative).
Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
An iterative graph processing system built for high scalability.
A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale.
Fast scalable machine learning API for smarter applications.
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) - in-parallel on large clusters (thousands of nodes) - of commodity hardware in a reliable, fault-tolerant manner.
Data warehouse software facilitates querying and managing large datasets residing in distributed storage.
Scalable machine learning library for Hive/Hadoop.
The REST Spark Server.
An environment for quickly creating scalable performant machine learning applications.
A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.
Python interface to Hive and Presto.
A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
Apache Spark's API for graphs and graph-parallel computation.
Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
A community index of packages for Apache Spark.
Examples by Zhen He.
Substation is a cloud native data pipeline and transformation toolkit written in Go.
An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.
Apache Superset (incubating) - A modern, enterprise-ready business intelligence web application.
D3-based reusable chart library.
A JavaScript library for manipulating documents based on data.
D3's simpler, easier to use cousin. Mostly predefined templates that you can just plug data in.
A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.
Metabase is the easy, open source way for everyone in your company to ask questions and learn from data.
Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python.
PyQtGraph is a pure-python graphics and GUI library built on PyQt4 / PySide and numpy. It is intended for use in mathematics / scientific / engineering applications.
Python helpers for building dashboards using Flask and React.
Make Your Company Data Driven. Connect to any data source, easily visualize and share your data.
A Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
A JavaScript Charting Library for Streaming Data.
Fast JavaScript charts for any data set.
News, tips, and background on Data Engineering.
Subreddit focused on ETL.
This blog offers a curated list of top data science books, categorized by topics and learning stages, to aid readers in building foundational knowledge and staying updated with industry trends.
Data Council is the first technical conference that bridges the gap between data scientists, data engineers and data analysts.
The show about modern data infrastructure.
A practical introduction to data engineering on the Snowflake cloud data platform.
A show where they talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.
Open-source data integration for modern data teams.
Apache Pulsar is an open-source distributed pub-sub messaging system.
A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Real-time data ingestion tool leveraging change data capture.
Utility belt to handle data on AWS.
A fully managed, cloud-based service for real-time data processing over large, distributed data streams.
Change data capture from PostgreSQL into Kafka. Deprecated.
A delimited data preboarding framework that fills the gap between MFT and the data lake.
A fast&simple pipeline building library for python data devs, runs in notebooks, cloud functions, airflow, etc.
An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
No/low-code data pipeline platform that handles both batch and real-time data ingestion.
An open source data collector for unified logging layer.
Universal data ingestion framework for Hadoop from LinkedIn.
Live import all your Google Sheets to your data warehouse.
Data Acquisition and Processing Made Easy. Deprecated.
Publish-subscribe messaging rethought as a distributed commit log.
Kafka in Docker.
Kafka-winston logger for Node.js from Uber.
A tool for managing Apache Kafka.
Node.js client for Apache Kafka 0.8.
Generic command line non-JVM Apache Kafka producer and consumer.
Simplified command-line administration for Kafka brokers.
The Apache Kafka C/C++ library.
CLI & code-first ELT.
Nakadi is an open source event messaging platform that provides a REST API on top of Kafka-like queues.
A PostgreSQL extension to produce messages to Apache Kafka.
Pravega provides a new storage abstraction - a stream - for continuous and unbounded data.
Robust messaging for applications.
Pinterest's Kafka to S3 distributed consumer.
The fastest way to build custom data extractors and loaders compliant with the Singer Spec.
Sling is CLI data integration tool specialized in moving data between databases, as well as storage systems.
Gravitino is an open-source, unified metadata management for data lakes, data warehouses, and external catalogs.
Ilum is a modular Data Lakehouse platform that simplifies the management and monitoring of Apache Spark clusters across Kubernetes and Hadoop environments.
lakeFS is an open source platform that delivers resilience and manageability to object-storage based data lakes.
Project Nessie is a Transactional Catalog for Data Lakes with Git-like semantics. Works with Apache Iceberg tables.
Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud.
An open source, distributed, in-memory database for scale-out applications.
A distributed free and open-source database with a flexible data model for documents, graphs, and key-values.
A fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale.
A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools.
A distributed system designed to ingest and process time series data.
The right choice when you need scalability and high availability without compromising performance.
This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application.
An open-source graph database. Google.
A script to easily create and destroy an Apache Cassandra cluster on localhost.
Distributed columnar DBMS for OLAP. SQL.
The highest performing NoSQL distributed database.
Scalable SQL database with the NOSQL goodies.
Fast distributed metrics database.
The fully transactional, cloud-ready, distributed database.
Column oriented distributed data store ideal for powering interactive applications.
DuckDB is a fast in-process analytical database that has zero external dependencies, runs on Linux/macOS/Windows, offers a rich SQL dialect, and is free and extensible.
Search & Analyze Data in Real Time.
Distributed. Columnar. Versioned. Streaming. SQL.
A distributed, fault-tolerant graph database by Twitter. Deprecated.
A large-scale graph database.
The Greenplum Database (GPDB) - An advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes.
The Hadoop database, a distributed, scalable, big data store.
A scalable time series database based on Cassandra and Elasticsearch, by Spotify.
HyperDex is a scalable, searchable key-value store. Deprecated.
Scalable datastore for metrics, events, and real-time analytics.
A key-value store for microcontroller and IoT applications.
Fast scalable time series database.
Kyoto Tycoon is a lightweight network server on top of the Kyoto Cabinet key-value database, built for high-performance and concurrency.
An enhanced, drop-in replacement for MySQL.
Distributed Transactional In-Memory Database (based on MongoDB).
An open-source, document database designed for ease of development and scaling.
The world's most popular open source database.
Pinterest MySQL Management Tools.
The world's leading graph database.
A scalable, distributed Time Series Database.
2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license.
Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.
Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®.
The world's most advanced open source database.
A relational column-oriented database designed for real-time analytics on time series and event data.
Fully Transactional NoSQL Document Database.
An open source, BSD licensed, advanced key-value cache and store.
The open-source database for the realtime web.
A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
A distributed database designed to deliver maximum data availability by distributing data across multiple servers.
Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
Replicated SQLite using the Raft consensus protocol.
NoSQL data store using the seastar framework, compatible with Apache Cassandra.
SnappyData: OLTP + OLAP Database built on Apache Spark.
A high performance NoSQL database supporting many data structures, an alternative to Redis.
Tarantool is an in-memory database and application server.
TiDB is a distributed NewSQL database compatible with MySQL protocol.
Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.
Built as an extension on top of PostgreSQL, TimescaleDB is a time-series SQL database providing fast analytics, scalability, with automated data management on a proven storage engine.
A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.
Distributed, MPP columnar database with extensive analytics SQL.
Open source repository of web crawl data.
Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.
GitHub's public timeline since 2011, updated every hour.
Real-time data is available including comments, submissions and links posted to reddit.
The Streaming APIs give developers low latency access to Twitter's global stream of Tweet data.
Wikipedia's complete copy of all wikis, in the form of Wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.
Analyzes resource usage and performance characteristics of running containers.
Easily manage Docker containers & their data.
Package golang service into minimal Docker containers.
Visualize Docker images and the layers that compose them.
Application Containers for Masses.
Docker microservice for saving/restoring volume data to S3.
Nomad is a cluster manager, designed for both long-lived services and short-lived batch processing workloads.
RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers.
Docker composition tool with idempotency features for deploying apps composed of multiple containers. Deprecated.
Weaving Docker containers into applications.
A lightweight tool for easy deployment and rollback of dockerized applications.
A highly configurable Logstash (1.4.4) - Docker image running Elasticsearch (1.7.0) - and Kibana (3.1.2).
JDBC importer for Elasticsearch.
Postgres Extension that allows creating an index backed by Elasticsearch.
Alluxio is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.
Object storage built to retrieve any amount of data from anywhere.
Ceph is a unified, distributed storage system designed for excellent performance, reliability, and scalability.
Gluster Filesystem.
A distributed file system designed to run on commodity hardware.
JuiceFS is a high-performance Cloud-Native file system driven by object storage for large-scale data storage.
LizardFS Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.
Orange File System is a branch of the Parallel Virtual File System.
S3QL is a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack.
Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".
Utils for streaming large files (S3, HDFS, gzip, bz2).
SnackFS is our bite-sized, lightweight HDFS compatible file system built over Cassandra.
A pure python HDFS client.
Fault-tolerant distributed file system for all storage needs.
Apache Avro™ is a data serialization system.
The smallest, fastest columnar storage for Hadoop workloads.
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
The Apache Thrift software framework, for scalable cross-language services development.
Kryo is a fast and efficient object graph serialization framework for Java.
A parallel implementation of gzip for modern multi-processor, multi-core machines.
Protocol Buffers - Google's data interchange format.
SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.
A fast compressor/decompressor. Used with Parquet.
Apache Beam is a unified programming model that implements both batch and streaming data processing jobs that run on many execution engines.
Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
An open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert.
An easy to use, powerful, and reliable system to process and distribute data.
Apache Samza is a distributed stream processing framework.
Apache Storm is a free and open source distributed realtime computation system.
Bonobo is a data-processing toolkit for python 3.5+.
An open source ETL framework to build fresh index for AI.
The streaming database built for IoT data storage and real-time processing.
An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices.
The Streaming SQL Database.
Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing.
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
Streaming and tasks execution between Spring Boot apps.
A framework for building real-time streaming data processing applications that supports a wide range of ingestion sources.
VoltDb is an ACID-compliant RDBMS which uses a shared nothing architecture.
An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT, and the native Kafka protocol.
Open Source Data Observability for end-to-end Data Journey Observability, data profiling, anomaly detection, and auto-created data quality validation tests.
An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring.
A data catalog tool that integrates into your CI system exposing downstream impact testing of data changes. These tests prevent data changes which might break data pipelines or BI dashboards from making it to production.
Free online SQL playground for MySQL, PostgreSQL, and SQL Server. Create database structures, run queries, and share results instantly.
Airflow is a system to programmatically author, schedule, and monitor data pipelines.
Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy-to-use web user interface to maintain and track your workflows.
Java based application development platform.
A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL.
An application cron-like system. Used w/Luige. Deprecated.
Dagster is an open-source Python library for building data applications.
An open-source framework and web based IDE to manage datasets and their dependencies. SQLX extends your existing SQL warehouse dialect to add features that support dependency management, testing, documentation and more.
A command line tool that enables data analysts and engineers to transform data in their warehouses more effectively.
Hamilton is a lightweight library to define data transformations as a directed-acyclic graph (DAG). If you like dbt for SQL transforms, you will like Hamilton for Python processing.
Kedro is a framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and pipeline assembly.
Scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
A versatile open source orchestrator and scheduler built on Java, designed to handle a broad range of workflows with a language-agnostic, API-first architecture.
Luigi is a Python module that helps you build complex pipelines of batch jobs.
Open-source data pipeline tool for transforming and integrating data.
The open-source reverse ETL, data activation platform for modern data teams.
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
An open source framework that allows you to enforce agreements on how data should be accessed, used, and transformed, regardless of the data platform (Snowflake, BigQuery, DataBricks, etc.)
DAG based workflow manager. Job flows are defined programmatically in Python. Support output passing between jobs.
Prefect is an orchestration and observability platform. With it, developers can rapidly build and scale resilient code, and triage disruptions effortlessly.
A warehouse-first Customer Data Platform that enables you to collect data from every application, website and SaaS platform, and then activate it in your warehouse and business tools.
Create automated workflows and logic using API's for your notification service. Add templates, batching, preferences, inapp inbox with workflows to trigger notifications directly from your data warehouse.