Data Engineering

225 resources17 categoriesView Original

Quick Navigation

Batch Processing(23 items)

AWS EMR

A web service that makes it easy to quickly and cost-effectively process vast amounts of data.

Bistro

A light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via functions and processes data via columns operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.

Batch Processing

Data Mechanics

A cloud-based platform deployed on Kubernetes making Apache Spark more developer-friendly and cost-effective.

Batch Processing

Deep Spark

Connecting Apache Spark with different data stores. Deprecated.

Batch Processing

Delight

A free & cross platform monitoring tool (Spark UI / Spark History Server alternative).

Batch Processing

Drill

Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.

Batch Processing

Giraph

An iterative graph processing system built for high scalability.

Batch Processing

GraphLab Create

A machine learning platform that enables data scientists and app developers to easily create intelligent apps at scale.

Batch Processing

H2O

Fast scalable machine learning API for smarter applications.

Batch Processing

Hadoop MapReduce

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) - in-parallel on large clusters (thousands of nodes) - of commodity hardware in a reliable, fault-tolerant manner.

Batch Processing

Hive

Data warehouse software facilitates querying and managing large datasets residing in distributed storage.

Batch Processing

Hivemall

Scalable machine learning library for Hive/Hadoop.

Batch Processing

Livy

The REST Spark Server.

Batch Processing

Mahout

An environment for quickly creating scalable performant machine learning applications.

Batch Processing

Presto

A distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources.

Batch Processing

PyHive

Python interface to Hive and Presto.

Batch Processing

Spark

A multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Batch Processing

Spark GraphX

Apache Spark's API for graphs and graph-parallel computation.

Batch Processing

Spark MLlib

Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.

Batch Processing

Spark Packages

A community index of packages for Apache Spark.

Batch Processing

Spark RDD API Examples

Examples by Zhen He.

Batch Processing

Substation

Substation is a cloud native data pipeline and transformation toolkit written in Go.

Batch Processing

Tez

An application framework which allows for a complex directed-acyclic-graph of tasks for processing data.

Batch Processing

Charts and Dashboards(13 items)

Apache Superset

Apache Superset (incubating) - A modern, enterprise-ready business intelligence web application.

Charts and Dashboards

C3.js

D3-based reusable chart library.

Charts and Dashboards

D3.js

A JavaScript library for manipulating documents based on data.

Charts and Dashboards

D3Plus

D3's simpler, easier to use cousin. Mostly predefined templates that you can just plug data in.

Charts and Dashboards

Highcharts

A charting library written in pure JavaScript, offering an easy way of adding interactive charts to your web site or web application.

Charts and Dashboards

Metabase

Metabase is the easy, open source way for everyone in your company to ask questions and learn from data.

Charts and Dashboards

Plotly

Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python.

Charts and Dashboards

PyQtGraph

PyQtGraph is a pure-python graphics and GUI library built on PyQt4 / PySide and numpy. It is intended for use in mathematics / scientific / engineering applications.

Charts and Dashboards

PyXley

Python helpers for building dashboards using Flask and React.

Charts and Dashboards

Redash

Make Your Company Data Driven. Connect to any data source, easily visualize and share your data.

Charts and Dashboards

Seaborn

A Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

Charts and Dashboards

SmoothieCharts

A JavaScript Charting Library for Streaming Data.

Charts and Dashboards

ZingChart

Fast JavaScript charts for any data set.

Charts and Dashboards

Community(7 items)

/r/dataengineering

News, tips, and background on Data Engineering.

Community

/r/etl

Subreddit focused on ETL.

Community

Best Data Science Books

This blog offers a curated list of top data science books, categorized by topics and learning stages, to aid readers in building foundational knowledge and staying updated with industry trends.

Community

Data Council

Data Council is the first technical conference that bridges the gap between data scientists, data engineers and data analysts.

Community

Data Engineering Podcast

The show about modern data infrastructure.

Community

Snowflake Data Engineering

A practical introduction to data engineering on the Snowflake cloud data platform.

Community

The Data Stack Show

A show where they talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

Community

Data Comparison(1 items)

datacompy

DataComPy is a Python library that facilitates the comparison of two DataFrames in pandas, Polars, Spark and more. The library goes beyond basic equality checks by providing detailed insights into discrepancies at both row and column levels.

Data Comparison

Data Ingestion(31 items)

Airbyte

Open-source data integration for modern data teams.

Data Ingestion

Apache Pulsar

Apache Pulsar is an open-source distributed pub-sub messaging system.

Data Ingestion

Apache Sqoop

A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Data Ingestion

Artie

Real-time data ingestion tool leveraging change data capture.

Data Ingestion

AWS Data Wrangler

Utility belt to handle data on AWS.

Data Ingestion

AWS Kinesis

A fully managed, cloud-based service for real-time data processing over large, distributed data streams.

Data Ingestion

BottledWater

Change data capture from PostgreSQL into Kafka. Deprecated.

Data Ingestion

CsvPath Framework

A delimited data preboarding framework that fills the gap between MFT and the data lake.

Data Ingestion

dlt

A fast&simple pipeline building library for python data devs, runs in notebooks, cloud functions, airflow, etc.

Data Ingestion

Embulk

An open source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.

Data Ingestion

Estuary Flow

No/low-code data pipeline platform that handles both batch and real-time data ingestion.

Data Ingestion

FluentD

An open source data collector for unified logging layer.

Data Ingestion

Gobblin

Universal data ingestion framework for Hadoop from LinkedIn.

Data Ingestion

Google Sheets ETL

Live import all your Google Sheets to your data warehouse.

Data Ingestion

Heka

Data Acquisition and Processing Made Easy. Deprecated.

Data Ingestion

Kafka

Publish-subscribe messaging rethought as a distributed commit log.

Data Ingestion

kafka-docker

Kafka in Docker.

Data Ingestion

Kafka-logger

Kafka-winston logger for Node.js from Uber.

Data Ingestion

kafka-manager

A tool for managing Apache Kafka.

Data Ingestion

kafka-node

Node.js client for Apache Kafka 0.8.

Data Ingestion

kafkacat

Generic command line non-JVM Apache Kafka producer and consumer.

Data Ingestion

kafkat

Simplified command-line administration for Kafka brokers.

Data Ingestion

librdkafka

The Apache Kafka C/C++ library.

Data Ingestion

Meltano

CLI & code-first ELT.

Data Ingestion

Nakadi

Nakadi is an open source event messaging platform that provides a REST API on top of Kafka-like queues.

Data Ingestion

pg-kafka

A PostgreSQL extension to produce messages to Apache Kafka.

Data Ingestion

Pravega

Pravega provides a new storage abstraction - a stream - for continuous and unbounded data.

Data Ingestion

RabbitMQ

Robust messaging for applications.

Data Ingestion

Secor

Pinterest's Kafka to S3 distributed consumer.

Data Ingestion

Singer SDK

The fastest way to build custom data extractors and loaders compliant with the Singer Spec.

Data Ingestion

Sling

Sling is CLI data integration tool specialized in moving data between databases, as well as storage systems.

Data Ingestion

Data Lake Management(4 items)

Gravitino

Gravitino is an open-source, unified metadata management for data lakes, data warehouses, and external catalogs.

Data Lake Management

Ilum

Ilum is a modular Data Lakehouse platform that simplifies the management and monitoring of Apache Spark clusters across Kubernetes and Hadoop environments.

Data Lake Management

lakeFS

lakeFS is an open source platform that delivers resilience and manageability to object-storage based data lakes.

Data Lake Management

Project Nessie

Project Nessie is a Transactional Catalog for Data Lakes with Git-like semantics. Works with Apache Iceberg tables.

Data Lake Management

Databases(58 items)

Akumuli

Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".

Databases

Amazon RDS

Amazon RDS makes it easy to set up, operate, and scale a relational database in the cloud.

Databases

Apache Geode

An open source, distributed, in-memory database for scale-out applications.

Databases

ArangoDB

A distributed free and open-source database with a flexible data model for documents, graphs, and key-values.

Databases

AWS DynamoDB

A fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale.

Databases

AWS Redshift

A fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools.

Databases

Blueflood

A distributed system designed to ingest and process time series data.

Databases

Cassandra

The right choice when you need scalability and high availability without compromising performance.

Databases

Cassandra Calculator

This simple form allows you to try out different values for your Apache Cassandra cluster and see what the impact is for your application.

Databases

cayley

An open-source graph database. Google.

Databases

CCM

A script to easily create and destroy an Apache Cassandra cluster on localhost.

Databases

ClickHouse

Distributed columnar DBMS for OLAP. SQL.

Databases

Couchbase

The highest performing NoSQL distributed database.

Databases

Crate.IO

Scalable SQL database with the NOSQL goodies.

Databases

Dalmatiner DB

Fast distributed metrics database.

Databases

DAtomic

The fully transactional, cloud-ready, distributed database.

Databases

Druid

Column oriented distributed data store ideal for powering interactive applications.

Databases

DuckDB

DuckDB is a fast in-process analytical database that has zero external dependencies, runs on Linux/macOS/Windows, offers a rich SQL dialect, and is free and extensible.

Databases

Elasticsearch

Search & Analyze Data in Real Time.

Databases

FiloDB

Distributed. Columnar. Versioned. Streaming. SQL.

Databases

FlockDB

A distributed, fault-tolerant graph database by Twitter. Deprecated.

Databases

Gaffer

A large-scale graph database.

Databases

GreenPlum

The Greenplum Database (GPDB) - An advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes.

Databases

HBase

The Hadoop database, a distributed, scalable, big data store.

Databases

Heroic

A scalable time series database based on Cassandra and Elasticsearch, by Spotify.

Databases

HyperDex

HyperDex is a scalable, searchable key-value store. Deprecated.

Databases

InfluxDB

Scalable datastore for metrics, events, and real-time analytics.

Databases

IonDB

A key-value store for microcontroller and IoT applications.

Databases

kairosdb

Fast scalable time series database.

Databases

Kyoto Tycoon

Kyoto Tycoon is a lightweight network server on top of the Kyoto Cabinet key-value database, built for high-performance and concurrency.

Databases

MariaDB

An enhanced, drop-in replacement for MySQL.

Databases

MemDB

Distributed Transactional In-Memory Database (based on MongoDB).

Databases

MongoDB

An open-source, document database designed for ease of development and scaling.

Databases

MySQL

The world's most popular open source database.

Databases

mysql_utils

Pinterest MySQL Management Tools.

Databases

Neo4j

The world's leading graph database.

Databases

OpenTSDB

A scalable, distributed Time Series Database.

Databases

OrientDB

2nd Generation Distributed Graph Database with the flexibility of Documents in one product with an Open Source commercial friendly license.

Databases

Percona Server for MongoDB

Percona Server for MongoDB® is a free, enhanced, fully compatible, open source, drop-in replacement for the MongoDB® Community Edition that includes enterprise-grade features and functionality.

Databases

Percona XtraBackup

Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®.

Databases

PostgreSQL

The world's most advanced open source database.

Databases

QuestDB

A relational column-oriented database designed for real-time analytics on time series and event data.

Databases

RavenDB

Fully Transactional NoSQL Document Database.

Databases

Redis

An open source, BSD licensed, advanced key-value cache and store.

Databases

RethinkDB

The open-source database for the realtime web.

Databases

Rhombus

A time-series object store for Cassandra that handles all the complexity of building wide row indexes.

Databases

Riak

A distributed database designed to deliver maximum data availability by distributing data across multiple servers.

Databases

Riak-TS

Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.

Databases

RQLite

Replicated SQLite using the Raft consensus protocol.

Databases

ScyllaDB

NoSQL data store using the seastar framework, compatible with Apache Cassandra.

Databases

Snappydata

SnappyData: OLTP + OLAP Database built on Apache Spark.

Databases

SSDB

A high performance NoSQL database supporting many data structures, an alternative to Redis.

Databases

Tarantool

Tarantool is an in-memory database and application server.

Databases

TiDB

TiDB is a distributed NewSQL database compatible with MySQL protocol.

Databases

Timely

Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.

Databases

TimescaleDB

Built as an extension on top of PostgreSQL, TimescaleDB is a time-series SQL database providing fast analytics, scalability, with automated data management on a proven storage engine.

Databases

Titan

A scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster.

Databases

Vertica

Distributed, MPP columnar database with extensive analytics SQL.

Databases

Datasets(6 items)

Common Crawl

Open source repository of web crawl data.

Datasets

Eventsim

Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.

Datasets

GitHub Archive

GitHub's public timeline since 2011, updated every hour.

Datasets

Real-time data is available including comments, submissions and links posted to reddit.

Datasets

Twitter Realtime

The Streaming APIs give developers low latency access to Twitter's global stream of Tweet data.

Datasets

Wikipedia

Wikipedia's complete copy of all wikis, in the form of Wikitext source and metadata embedded in XML. A number of raw database tables in SQL form are also available.

Datasets

Docker(11 items)

cAdvisor

Analyzes resource usage and performance characteristics of running containers.

Docker

Flocker

Easily manage Docker containers & their data.

Docker

Gockerize

Package golang service into minimal Docker containers.

Docker

ImageLayers

Visualize Docker images and the layers that compose them.

Docker

Kontena

Application Containers for Masses.

Docker

Micro S3 persistence

Docker microservice for saving/restoring volume data to S3.

Docker

Nomad

Nomad is a cluster manager, designed for both long-lived services and short-lived batch processing workloads.

Docker

Rancher

RancherOS is a 20mb Linux distro that runs the entire OS as Docker containers.

Docker

Rocker-compose

Docker composition tool with idempotency features for deploying apps composed of multiple containers. Deprecated.

Docker

Weave

Weaving Docker containers into applications.

Docker

Zodiac

A lightweight tool for easy deployment and rollback of dockerized applications.

Docker

ELK Elastic Logstash Kibana(3 items)

docker-logstash

A highly configurable Logstash (1.4.4) - Docker image running Elasticsearch (1.7.0) - and Kibana (3.1.2).

ELK Elastic Logstash Kibana

elasticsearch-jdbc

JDBC importer for Elasticsearch.

ELK Elastic Logstash Kibana

ZomboDB

Postgres Extension that allows creating an index backed by Elasticsearch.

ELK Elastic Logstash Kibana

File System(14 items)

Alluxio

Alluxio is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.

File System

AWS S3

Object storage built to retrieve any amount of data from anywhere.

File System

CEPH

Ceph is a unified, distributed storage system designed for excellent performance, reliability, and scalability.

File System

GlusterFS

Gluster Filesystem.

File System

HDFS

A distributed file system designed to run on commodity hardware.

File System

JuiceFS

JuiceFS is a high-performance Cloud-Native file system driven by object storage for large-scale data storage.

File System

LizardFS

LizardFS Software Defined Storage is a distributed, parallel, scalable, fault-tolerant, Geo-Redundant and highly available file system.

File System

OrangeFS

Orange File System is a branch of the Parallel Virtual File System.

File System

S3QL

S3QL is a file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack.

File System

SeaweedFS

Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".

File System

smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2).

File System

SnackFS

SnackFS is our bite-sized, lightweight HDFS compatible file system built over Cassandra.

File System

Snakebite

A pure python HDFS client.

File System

XtreemFS

Fault-tolerant distributed file system for all storage needs.

File System

Monitoring(2 items)

HAProxy Exporter

Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption.

Monitoring

Prometheus.io

An open-source service monitoring system and time series database.

Monitoring

Profiling(1 items)

Data Profiler

The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy.

Profiling

Serialization format(9 items)

Apache Avro

Apache Avro™ is a data serialization system.

Serialization format

Apache ORC

The smallest, fastest columnar storage for Hadoop workloads.

Serialization format

Apache Parquet

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Serialization format

Apache Thrift

The Apache Thrift software framework, for scalable cross-language services development.

Serialization format

Kryo

Kryo is a fast and efficient object graph serialization framework for Java.

Serialization format

PigZ

A parallel implementation of gzip for modern multi-processor, multi-core machines.

Serialization format

ProtoBuf

Protocol Buffers - Google's data interchange format.

Serialization format

SequenceFile

SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.

Serialization format

Snappy

A fast compressor/decompressor. Used with Parquet.

Serialization format

Stream Processing(17 items)

Apache Beam

Apache Beam is a unified programming model that implements both batch and streaming data processing jobs that run on many execution engines.

Stream Processing

Apache Flink

Apache Flink is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.

Stream Processing

Apache Hudi

An open source framework for managing storage for real time processing, one of the most interesting feature is the Upsert.

Stream Processing

Apache NiFi

An easy to use, powerful, and reliable system to process and distribute data.

Stream Processing

Apache Samza

Apache Samza is a distributed stream processing framework.

Stream Processing

Apache Storm

Apache Storm is a free and open source distributed realtime computation system.

Stream Processing

Bonobo

Bonobo is a data-processing toolkit for python 3.5+.

Stream Processing

CocoIndex

An open source ETL framework to build fresh index for AI.

Stream Processing

HStreamDB

The streaming database built for IoT data storage and real-time processing.

Stream Processing

Kuiper

An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices.

Stream Processing

PipelineDB

The Streaming SQL Database.

Stream Processing

Robinhood's Faust

Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing.

Stream Processing

Spark Streaming

Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.

Stream Processing

Spring Cloud Dataflow

Streaming and tasks execution between Spring Boot apps.

Stream Processing

SwimOS

A framework for building real-time streaming data processing applications that supports a wide range of ingestion sources.

Stream Processing

VoltDB

VoltDb is an ACID-compliant RDBMS which uses a shared nothing architecture.

Stream Processing

Zilla

An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT, and the native Kafka protocol.

Stream Processing

Testing(4 items)

DataKitchen

Open Source Data Observability for end-to-end Data Journey Observability, data profiling, anomaly detection, and auto-created data quality validation tests.

Testing

DQOps

An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring.

Testing

Grai

A data catalog tool that integrates into your CI system exposing downstream impact testing of data changes. These tests prevent data changes which might break data pipelines or BI dashboards from making it to production.

Testing

RunSQL

Free online SQL playground for MySQL, PostgreSQL, and SQL Server. Create database structures, run queries, and share results instantly.

Testing

Workflow(21 items)

Airflow

Airflow is a system to programmatically author, schedule, and monitor data pipelines.

Workflow

Azkaban

Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy-to-use web user interface to maintain and track your workflows.

Workflow

Cascading

Java based application development platform.

Workflow

Census

A reverse-ETL tool that let you sync data from your cloud data warehouse to SaaS applications like Salesforce, Marketo, HubSpot, Zendesk, etc. No engineering favors required—just SQL.

Workflow

CronQ

An application cron-like system. Used w/Luige. Deprecated.

Workflow

Dagster

Dagster is an open-source Python library for building data applications.

Workflow

Dataform

An open-source framework and web based IDE to manage datasets and their dependencies. SQLX extends your existing SQL warehouse dialect to add features that support dependency management, testing, documentation and more.

Workflow

dbt

A command line tool that enables data analysts and engineers to transform data in their warehouses more effectively.

Workflow

Hamilton

Hamilton is a lightweight library to define data transformations as a directed-acyclic graph (DAG). If you like dbt for SQL transforms, you will like Hamilton for Python processing.

Workflow

Kedro

Kedro is a framework that makes it easy to build robust and scalable data pipelines by providing uniform project templates, data abstraction, configuration and pipeline assembly.

Workflow

Kestra

Scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.

Workflow

Kestra

A versatile open source orchestrator and scheduler built on Java, designed to handle a broad range of workflows with a language-agnostic, API-first architecture.

Workflow

Luigi

Luigi is a Python module that helps you build complex pipelines of batch jobs.

Workflow

Mage

Open-source data pipeline tool for transforming and integrating data.

Workflow

Multiwoven

The open-source reverse ETL, data activation platform for modern data teams.

Workflow

Oozie

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Workflow

PACE

An open source framework that allows you to enforce agreements on how data should be accessed, used, and transformed, regardless of the data platform (Snowflake, BigQuery, DataBricks, etc.)

Workflow

Pinball

DAG based workflow manager. Job flows are defined programmatically in Python. Support output passing between jobs.

Workflow

Prefect

Prefect is an orchestration and observability platform. With it, developers can rapidly build and scale resilient code, and triage disruptions effortlessly.

Workflow

RudderStack

A warehouse-first Customer Data Platform that enables you to collect data from every application, website and SaaS platform, and then activate it in your warehouse and business tools.

Workflow

SuprSend

Create automated workflows and logic using API's for your notification service. Add templates, batching, preferences, inapp inbox with workflows to trigger notifications directly from your data warehouse.

Workflow