Back to Categories

Big Data

624 resources33 categoriesView Original

Quick Navigation

Applications(35 items)

411

an web application for alert management resulting from scheduled searches into Elasticsearch.

Adobe spindle

Next-generation web analytics processing with Scala, Spark, and Parquet.

Apache Metron

a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.

Apache Nutch

open source web crawler.

Apache OODT

capturing, processing and sharing of data for NASA's scientific archives.

Apache Tika

content analysis toolkit.

Argus

Time series monitoring and alerting platform.

AthenaX

a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).

Atlas

a backend for managing dimensional time series data.

Comet

Comet provides an end-to-end model evaluation platform for AI developers, with best in class LLM evaluations, experiment tracking, and production monitoring.

Countly

open source mobile and web analytics platform, based on Node.js & MongoDB.

Domino

Run, scale, share, and deploy models — without any infrastructure.

Eclipse BIRT

Eclipse-based reporting system.

ElastAert

ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.

Eventhub

open source event analytics platform.

HASH

open source simulation and visualization platform.

Hermes

asynchronous message broker built on top of Kafka.

Hunk

Splunk analytics for Hadoop.

Imhotep

Large scale analytics platform by indeed.

Indicative

Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.

Jupyter

Notebook and project application for interactive data science and scientific computing across all programming languages.

Kapacitor

an open source framework for processing, monitoring, and alerting on time series data.

Kylin

open source Distributed Analytics Engine from eBay.

MADlib

data-processing library of an RDBMS to analyze data.

Opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

PivotalR

R on Pivotal HD / HAWQ and PostgreSQL.

Qubole

auto-scaling Hadoop cluster, built-in data connectors.

Rakam

open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.

SnappyData

a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.

Snowplow

enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.

SparkR

R frontend for Spark.

Splunk

analyzer for machine-generated data.

Substation

Substation is a cloud native data pipeline and transformation toolkit written in Go.

Sumo Logic

cloud based analyzer for machine-generated data.

Talend

unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.

Benchmarking(7 items)

Apache Hadoop Benchmarking

micro-benchmarks for testing Hadoop performances.

Berkeley SWIM Benchmark

real-world big data workload benchmark.

Deeplearning4j Benchmarks

Intel HiBench

a Hadoop benchmark suite.

PUMA Benchmarking

benchmark suite for MapReduce applications.

UCSB

extended Yahoo Cloud Serving Benchmark for NoSQL databases.

Yahoo Gridmix3

Hadoop cluster benchmarking from Yahoo engineer team.

Books(35 items)

awesome

Even more lists .

awesome-analytics

Analytics .

awesome-awesome-awesome

WTF! .

awesome-awesomeness

Other awesome lists .

awesome-community-detection

Community Detection .

awesome-decision-tree-papers

Decision Tree Papers .

awesome-fraud-detection-papers

Fraud Detection Papers .

awesome-gradient-boosting-papers

Gradient Boosting Papers .

awesome-graph-classification

Graph Classification .

awesome-kafka

Kafka .

awesome-monte-carlo-tree-search-papers

Monte Carlo Tree Search Papers .

awesome-network-embedding

Network Embedding .

awesome-public-datasets

Public Datasets .

Azure Data Engineering

A book about data engineering in general and the Azure platform specifically

Big Data

Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.

Data Science at Scale with Python and Dask

Data Science at Scale with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data.

Designing Data Visualizations with Noah Iliinsky

Distributed Systems for fun and profit

– Theory of distributed systems. Include parts about time and ordering, replication and impossibility results.

Fundamentals of Stream Processing: Application ...

This comprehensive, hands-on guide combining the fundamental building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.

Fusion in Action

Fusion in Action teaches you to build a full-featured data analytics pipeline, including document and data search and distributed data clustering.

Google Bigtable

.

Graph-Powered Machine Learning

Alessandro Negro. Combine graph theory and models to improve machine learning projects

Grokking Streaming Systems

Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether they’re right for your business. Written to be tool-agnostic, you’ll be able to apply what you learn no matter which framework you choose.

Hans Rosling's 200 Countries, 200 Years, 4 Minutes

Ice Bucket Challenge Data Visualization

Kafka in Action

Kafka in Action is a fast-paced introduction to every aspect of working with Kafka you need to really reap its benefits.

Kafka Streams in Action

Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort.

list

Another list? .

Reactive Data Handling

Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications capable of handling real-time processing with large data loads--free eBook!

Spark in Action

& Spark in Action 2nd Ed. - Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0.

Storm Applied

Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.

Stream Data Processing: A Quality of Service Pe...

Presents a new paradigm suitable for stream and complex event processing.

Streaming Data

Streaming Data introduces the concepts and requirements of streaming and real-time data systems.

The beauty of data visualization

Unified Log Processing

Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business

Business Intelligence(24 items)

BIME Analytics

business intelligence platform in the cloud.

Business Intelligence

Blazer

business intelligence made simple.

Business Intelligence

Chartio

lean business intelligence platform to visualize and explore your data.

Business Intelligence

Count

notebook-based anlytics and visualisation platform using SQL or drag-and-drop.

Business Intelligence

datapine

self-service business intelligence tool in the cloud.

Business Intelligence

Dekart

Large scale geospatial analytics for Google BigQuery based on Kepler.gl.

Business Intelligence

GoodData

platform for data products and embedded analytics.

Business Intelligence

intermix.io

Performance Monitoring for Amazon Redshift

Business Intelligence

Jaspersoft

powerful business intelligence suite.

Business Intelligence

Jedox Palo

customisable Business Intelligence platform.

Business Intelligence

Jethrodata

Interactive Big Data Analytics.

Business Intelligence

Knowage

open source business intelligence platform. (former SpagoBi)

Business Intelligence

Lightdash

The open source Looker alternative built on dbt

Business Intelligence

Metabase

The simplest, fastest way to get business intelligence and analytics to everyone in your company.

Business Intelligence

Microsoft

business intelligence software and platform.

Business Intelligence

Microstrategy

software platforms for business intelligence, mobile intelligence, and network applications.

Business Intelligence

Numeracy

Fast, clean SQL client and business intelligence.

Business Intelligence

Pentaho

business intelligence platform.

Business Intelligence

Qlik

business intelligence and analytics platform.

Business Intelligence

Redash

Open source business intelligence platform, supporting multiple data sources and planned queries.

Business Intelligence

Saiku Analytics

Open source analytics platform.

Business Intelligence

SparklineData SNAP

modern B.I platform powered by Apache Spark.

Business Intelligence

Tableau

business intelligence platform.

Business Intelligence

Zoomdata

Big Data Analytics.

Business Intelligence

Columnar Databases(13 items)

Actian Vector

column-oriented analytic database.

Columnar Databases

Amazon Redshift

Amazon's cloud offering, also based on a columnar datastore backend.

Columnar Databases

ClickHouse

an open-source column-oriented database management system that allows generating analytical data reports in real time.

Columnar Databases

Columnar Storage

an explanation of what columnar storage is and when you might want it.

Columnar Databases

EventQL

a distributed, column-oriented database built for large-scale event collection and analytics.

Columnar Databases

Google BigQuery

Google's cloud offering backed by their pioneering work on Dremel.

Columnar Databases

IndexR

an open-source columnar storage format for fast & realtime analytic with big data.

Columnar Databases

LocustDB

an experimental analytics database aiming to set a new standard for query performance on commodity hardware.

Columnar Databases

MonetDB

column store database.

Columnar Databases

Parquet

columnar storage format for Hadoop.

Columnar Databases

Pivotal Greenplum

purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.

Columnar Databases

SQream DB

A GPU powered big data database, designed for analytics and data warehousing, with ANSI-92 compliant SQL, suitable for data sets from 10TB to 1PB.

Columnar Databases

Vertica

is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.

Columnar Databases

Data Ingestion(30 items)

Alooma

data pipeline as a service enabling moving data sources such as MySQL into data warehouses.

Amazon Kinesis

real-time processing of streaming data at massive scale.

Amazon Web Services Glue

serverless fully managed extract, transform, and load (ETL) service

Apache Chukwa

data collection system.

Apache Flume

service to manage large amount of log data.

Apache Kafka

distributed publish-subscribe messaging system.

Apache NiFi

Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.

Apache Pulsar

a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.

Apache Sqoop

tool to transfer data between Hadoop and a structured datastore.

Census

A reverse ETL product that let you sync data from your data warehouse to SaaS Applications. No engineering favors required—just SQL.

Embulk

open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.

Facebook Scribe

streamed log data aggregator.

Fluentd

tool to collect events and logs.

Gazette

Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.

Google Photon

geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.

Heka

open source stream processing software system.

HIHO

framework for connecting disparate data sources with Hadoop.

Kestrel

distributed message queue system.

LinkedIn Databus

stream of change capture events for a database.

Linkedin Gobblin

linkedin's universal data ingestion framework.

LinkedIn Kamikaze

utility package for compressing sorted integer arrays.

LinkedIn White Elephant

log aggregator and dashboard.

Logstash

a tool for managing events and logs.

Netflix Suro

log agregattor like Storm and Samza based on Chukwa.

Pinterest Secor

is a service implementing Kafka log persistance.

redpanda

A Kafka® replacement for mission critical systems; 10x faster. Written in C++.

RudderStack

an open source customer data infrastructure (segment, mParticle alternative) written in go.

Skizze

sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.

StreamSets Data Collector

continuous big data ingest infrastructure with a simple to use IDE.

Zilla

An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka protocol.

Data Visualization(50 items)

Airpal

Web UI for PrestoDB.

Data Visualization

AnyChart

fast, simple and flexible JavaScript (HTML5) charting library featuring pure JS API.

Data Visualization

Arbor

graph visualization library using web workers and jQuery.

Data Visualization

Banana

visualize logs and time-stamped data stored in Solr. Port of Kibana.

Data Visualization

Bloomery

Web UI for Impala.

Data Visualization

Bokeh

A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, concise construction of novel graphics in the style of D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.

Data Visualization

C3

D3-based reusable chart library

Data Visualization

CartoDB

open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.

Data Visualization

Chart.js

open source HTML5 Charts visualizations.

Data Visualization

chartd

responsive, retina-compatible charts with just an img tag.

Data Visualization

Chartist.js

another open source HTML5 Charts visualization.

Data Visualization

Crossfilter

JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.

Data Visualization

Cubism

JavaScript library for time series visualization.

Data Visualization

Cytoscape

JavaScript library for visualizing complex networks.

Data Visualization

D3

javaScript library for manipulating documents.

Data Visualization

D3.compose

Compose complex, data-driven visualizations from reusable charts and components.

Data Visualization

D3Plus

A fairly robust set of reusable charts and styles for d3.js.

Data Visualization

Dash

Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required

Data Visualization

DataSphere Studio

one-stop data application development management portal.

Data Visualization

DC.js

Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.

Data Visualization

Dekart

Large scale geospatial analytics for Google BigQuery based on Kepler.gl.

Data Visualization

DevExtreme React Chart

High-performance plugin-based React chart for Bootstrap and Material Design.

Data Visualization

Echarts

Baidus enterprise charts.

Data Visualization

Envisionjs

dynamic HTML5 visualization.

Data Visualization

FnordMetric

write SQL queries that return SVG charts rather than tables

Data Visualization

Frappe Charts

GitHub-inspired simple and modern SVG charts for the web with zero dependencies.

Data Visualization

Freeboard

pen source real-time dashboard builder for IOT and other web mashups.

Data Visualization

Gephi

An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.

Data Visualization

Google Charts

simple charting API.

Data Visualization

Grafana

graphite dashboard frontend, editor and graph composer.

Data Visualization

Graphite

scalable Realtime Graphing.

Data Visualization

Highcharts

simple and flexible charting API.

Data Visualization

IPython

provides a rich architecture for interactive computing.

Data Visualization

Kibana

visualize logs and time-stamped data

Data Visualization

Lumify

open source big data analysis and visualization platform

Data Visualization

Matplotlib

plotting with Python.

Data Visualization

Metricsgraphic.js

a library built on top of D3 that is optimized for time-series data

Data Visualization

NVD3

chart components for d3.js.

Data Visualization

Peity

Progressive SVG bar, line and pie charts.

Data Visualization

Plot.ly

Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots.

Data Visualization

Plotly.js

The open source javascript graphing library that powers plotly.

Data Visualization

ReCharts

A composable charting library built on React components

Data Visualization

Recline

simple but powerful library for building data applications in pure Javascript and HTML.

Data Visualization

Redash

open-source platform to query and visualize data.

Data Visualization

Shiny

a web application framework for R.

Data Visualization

Sigma.js

JavaScript library dedicated to graph drawing.

Data Visualization

Superset

a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.

Data Visualization

Vega

a visualization grammar.

Data Visualization

Zeppelin

a notebook-style collaborative data analysis.

Data Visualization

Zing Charts

JavaScript charting library for big data.

Data Visualization

Distributed Filesystem(18 items)

Alluxio

reliable file sharing at memory speed across cluster frameworks.

Distributed Filesystem

Ambry

a distributed object store that supports storage of trillion of small immutable objects as well as billions of large objects.

Distributed Filesystem

Apache HDFS

a way to store large files across multiple machines.

Distributed Filesystem

Apache Kudu

Hadoop's storage layer to enable fast analytics on fast data.

Distributed Filesystem

Baidu File System

distributed filesystem.

Distributed Filesystem

BeeGFS

formerly FhGFS, parallel distributed file system.

Distributed Filesystem

Ceph Filesystem

software storage platform designed.

Distributed Filesystem

Disco DDFS

distributed filesystem.

Distributed Filesystem

Facebook Haystack

object storage system.

Distributed Filesystem

Google GFS

distributed filesystem.

Distributed Filesystem

Google Megastore

scalable, highly available storage.

Distributed Filesystem

GridGain

GGFS, Hadoop compliant in-memory file system.

Distributed Filesystem

Lustre file system

high-performance distributed filesystem.

Distributed Filesystem

Microsoft Azure Data Lake Store

HDFS-compatible storage in Azure cloud

Distributed Filesystem

Quantcast File System QFS

open-source distributed file system.

Distributed Filesystem

Red Hat GlusterFS

scale-out network-attached storage file system.

Distributed Filesystem

Seaweed-FS

simple and highly scalable distributed file system.

Distributed Filesystem

Tahoe-LAFS

decentralized cloud storage system.

Distributed Filesystem

Distributed Index(1 items)

Pilosa

Open source distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.

Distributed Index

Distributed Programming(54 items)

AddThis Hydra

distributed data processing and storage system originally developed at AddThis.

Distributed Programming

AMPLab SIMR

run Spark on Hadoop MapReduce v1.

Distributed Programming

Apache APEX

a unified, enterprise platform for big data stream and batch processing.

Distributed Programming

Apache Beam

an unified model and set of language-specific SDKs for defining and executing data processing workflows.

Distributed Programming

Apache Crunch

a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.

Distributed Programming

Apache DataFu

collection of user-defined functions for Hadoop and Pig developed by LinkedIn.

Distributed Programming

Apache Flink

high-performance runtime, and automatic program optimization.

Distributed Programming

Apache Gearpump

real-time big data streaming engine based on Akka.

Distributed Programming

Apache Gora

framework for in-memory data model and persistence.

Distributed Programming

Apache Hama

BSP (Bulk Synchronous Parallel) computing framework.

Distributed Programming

Apache MapReduce

programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

Distributed Programming

Apache Pig

high level language to express data analysis programs for Hadoop.

Distributed Programming

Apache REEF

retainable evaluator execution framework to simplify and unify the lower layers of big data systems.

Distributed Programming

Apache S4

framework for stream processing, implementation of S4.

Distributed Programming

Apache Samza

stream processing framework, based on Kafka and YARN.

Distributed Programming

Apache Spark

framework for in-memory cluster computing.

Distributed Programming

Apache Spark Streaming

framework for stream processing, part of Spark.

Distributed Programming

Apache Storm

framework for stream processing by Twitter also on YARN.

Distributed Programming

Apache Tez

application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.

Distributed Programming

Apache Twill

abstraction over YARN that reduces the complexity of developing distributed applications.

Distributed Programming

Baidu Bigflow

an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.

Distributed Programming

Cascalog

data processing and querying library.

Distributed Programming

Cheetah

High Performance, Custom Data Warehouse on Top of MapReduce.

Distributed Programming

Concurrent Cascading

framework for data management/analytics on Hadoop.

Distributed Programming

Damballa Parkour

MapReduce library for Clojure.

Distributed Programming

Datasalt Pangool

alternative MapReduce paradigm.

Distributed Programming

DataTorrent StrAM

real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.

Distributed Programming

Facebook Corona

Hadoop enhancement which removes single point of failure.

Distributed Programming

Facebook Peregrine

Map Reduce framework.

Distributed Programming

Facebook Scuba

distributed in-memory datastore.

Distributed Programming

Google Dataflow

create data pipelines to help themæingest, transform and analyze data.

Distributed Programming

Google MapReduce

map reduce framework.

Distributed Programming

Google MillWheel

fault tolerant stream processing framework.

Distributed Programming

IBM Streams

platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time series, etc. out of the box.

Distributed Programming

JAQL

declarative programming language for working with structured, semi-structured and unstructured data.

Distributed Programming

Kite

is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.

Distributed Programming

Metamarkets Druid

framework for real-time analysis of large datasets.

Distributed Programming

Netflix PigPen

map-reduce for Clojure which compiles to Apache Pig.

Distributed Programming

Nokia Disco

MapReduce framework developed by Nokia.

Distributed Programming

Onyx

Distributed computation for the cloud.

Distributed Programming

Pinterest Pinlater

asynchronous job execution system.

Distributed Programming

Pydoop

Python MapReduce and HDFS API for Hadoop.

Distributed Programming

Rackerlabs Blueflood

multi-tenant distributed metric processing system

Distributed Programming

Ray

A fast and simple framework for building and running distributed applications.

Distributed Programming

Skale

High performance distributed data processing in NodeJS.

Distributed Programming

Stratosphere

general purpose cluster computing framework.

Distributed Programming

Streamdrill

useful for counting activities of event streams over different time windows and finding the most active one.

Distributed Programming

streamsx.topology

Libraries to enable building IBM Streams application in Java, Python or Scala.

Distributed Programming

Tuktu

Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!

Distributed Programming

Twitter Heron

Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter replacing Storm.

Distributed Programming

Twitter Scalding

Scala library for Map Reduce jobs, built on Cascading.

Distributed Programming

Twitter Summingbird

Streaming MapReduce with Scalding and Storm, by Twitter.

Distributed Programming

Twitter TSAR

TimeSeries AggregatoR by Twitter.

Distributed Programming

Wallaroo

The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.

Distributed Programming

Document Data Model(10 items)

Actian Versant

commercial object-oriented database management systems .

Document Data Model

Crate Data

is an open source massively scalable data store. It requires zero administration.

Document Data Model

Facebook Apollo

Facebook’s Paxos-like NoSQL database.

Document Data Model

jumboDB

document oriented datastore over Hadoop.

Document Data Model

LinkedIn Espresso

horizontally scalable document-oriented NoSQL data store.

Document Data Model

MarkLogic

Schema-agnostic Enterprise NoSQL database technology.

Document Data Model

Microsoft Azure DocumentDB

NoSQL cloud database service with protocol support for MongoDB

Document Data Model

MongoDB

Document-oriented database system.

Document Data Model

RavenDB

A transactional, open-source Document Database.

Document Data Model

RethinkDB

document database that supports queries like table joins and group by.

Document Data Model

Embedded Databases(6 items)

Actian PSQL

ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.

Embedded Databases

BerkeleyDB

a software library that provides a high-performance embedded database for key/value data.

Embedded Databases

HanoiDB

Erlang LSM BTree Storage.

Embedded Databases

LevelDB

a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

Embedded Databases

LMDB

ultra-fast, ultra-compact key-value embedded data store developed by Symas.

Embedded Databases

RocksDB

embeddable persistent key-value store for fast storage based on LevelDB.

Embedded Databases

Frameworks(7 items)

Apache Hadoop

framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).

Bistro

general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via *functions* and processes data via *column operations* as opposed to having only set operations in conventional approaches like MapReduce or SQL.

IBM Streams

platform for distributed processing and real-time analytics. Integrates with many of the popular technologies in the Big Data ecosystem (Kafka, HDFS, Spark, etc.)

Pachyderm

Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.

Polyaxon

A platform for reproducible and scalable machine learning and deep learning.

Smooks

An extensible Java framework for building XML and non-XML (CSV, EDI, Java, etc...) streaming applications.

Tigon

High Throughput Real-time Stream Processing Framework.

Graph Data Model(24 items)

AgensGraph

a new generation multi-model graph database for the modern complex data environment.

Graph Data Model

Apache Giraph

implementation of Pregel, based on Hadoop.

Graph Data Model

Apache Spark Bagel

implementation of Pregel, part of Spark.

Graph Data Model

ArangoDB

multi model distributed database.

Graph Data Model

DGraph

A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.

Graph Data Model

EliasDB

a lightweight graph based database that does not require any third-party libraries.

Graph Data Model

Facebook TAO

TAO is the distributed data store that is widely used at facebook to store and serve the social graph.

Graph Data Model

GCHQ Gaffer

Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.

Graph Data Model

Google Cayley

open-source graph database.

Graph Data Model

Google Pregel

graph processing framework.

Graph Data Model

GraphLab PowerGraph

a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.

Graph Data Model

GraphX

resilient Distributed Graph System on Spark.

Graph Data Model

Gremlin

graph traversal Language.

Graph Data Model

Infovore

RDF-centric Map/Reduce framework.

Graph Data Model

Intel GraphBuilder

tools to construct large-scale graphs on top of Hadoop.

Graph Data Model

JanusGraph

open-source, distributed graph database

Graph Data Model

MapGraph

Massively Parallel Graph processing on GPUs.

Graph Data Model

Microsoft Graph Engine

a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.

Graph Data Model

Neo4j

graph database written entirely in Java.

Graph Data Model

NodeXL

A free, open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs.

Graph Data Model

OrientDB

document and graph database.

Graph Data Model

Phoebus

framework for large scale graph processing.

Graph Data Model

Titan

distributed graph database, built over Cassandra.

Graph Data Model

Twitter FlockDB

distributed graph database.

Graph Data Model

Interesting Papers(40 items)

2003

Google** - The Google File System.

Interesting Papers

2004

Google** - MapReduce: Simplied Data Processing on Large Clusters.

Interesting Papers

2006

Google** - The Chubby lock service for loosely-coupled distributed systems.

Interesting Papers

2006

Google** - Bigtable: A Distributed Storage System for Structured Data.

Interesting Papers

2007

Amazon** - Dynamo: Amazon’s Highly Available Key-value Store.

Interesting Papers

2008

AMPLab** - Chukwa: A large-scale monitoring system.

Interesting Papers

2009

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.

Interesting Papers

2010

Facebook** - Finding a needle in Haystack: Facebook’s photo storage.

Interesting Papers

2010

AMPLab** - Spark: Cluster Computing with Working Sets.

Interesting Papers

2010

Google** - Pregel: A System for Large-Scale Graph Processing.

Interesting Papers

2010

Google** - Large-scale Incremental Processing Using Distributed Transactions and notifications base of Percolator and Caffeine.

Interesting Papers

2010

Google** - Dremel: Interactive Analysis of Web-Scale Datasets.

Interesting Papers

2010

Yahoo** - S4: Distributed Stream Computing Platform.

Interesting Papers

2011

AMPLab** - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.

Interesting Papers

2011

AMPLab** - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.

Interesting Papers

2011

Google** - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.

Interesting Papers

2012

Twitter** - The Unified Logging Infrastructure

Interesting Papers

2012

AMPLab** - Blink and It’s Done: Interactive Queries on Very Large Data.

Interesting Papers

2012

AMPLab** - Fast and Interactive Analytics over Hadoop Data with Spark.

Interesting Papers

2012

AMPLab** - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.

Interesting Papers

2012

Microsoft** - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.

Interesting Papers

2012

Microsoft** - Paxos Made Parallel.

Interesting Papers

2012

AMPLab** - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.

Interesting Papers

2012

Google** - Processing a trillion cells per mouse click.

Interesting Papers

2012

Google** - Spanner: Google’s Globally-Distributed Database.

Interesting Papers

2013

AMPLab** - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.

Interesting Papers

2013

AMPLab** - MLbase: A Distributed Machine-learning System.

Interesting Papers

2013

AMPLab** - Shark: SQL and Rich Analytics at Scale.

Interesting Papers

2013

AMPLab** - GraphX: A Resilient Distributed Graph System on Spark.

Interesting Papers

2013

Google** - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.

Interesting Papers

2013

Microsoft** - Scalable Progressive Analytics on Big Data in the Cloud.

Interesting Papers

2013

Metamarkets** - Druid: A Real-time Analytical Data Store.

Interesting Papers

2013

Google** - Online, Asynchronous Schema Change in F1.

Interesting Papers

2013

Google** - F1: A Distributed SQL Database That Scales.

Interesting Papers

2013

Google** - MillWheel: Fault-Tolerant Stream Processing at Internet Scale.

Interesting Papers

2013

Facebook** - Scuba: Diving into Data at Facebook.

Interesting Papers

2013

Facebook** - Unicorn: A System for Searching the Social Graph.

Interesting Papers

2013

Facebook** - Scaling Memcache at Facebook.

Interesting Papers

2014

Stanford** - Mining of Massive Datasets.

Interesting Papers

2015

Facebook** - One Trillion Edges: Graph Processing at Facebook-Scale.

Interesting Papers

Interesting Readings(5 items)

Big Data Benchmark

Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.

Interesting Readings

Monitoring Cassandra performance

Guide to monitoring Cassandra, including native methods for metrics collection.

Interesting Readings

Monitoring Hadoop performance

Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.

Interesting Readings

Monitoring Kafka performance

Guide to monitoring Apache Kafka, including native methods for metrics collection.

Interesting Readings

NoSQL Comparison

Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.

Interesting Readings

Internet of things and sensor data(10 items)

2lemetry

Platform for Internet of things.

Internet of things and sensor data

Ably

Pub/sub messaging platform for IoT

Internet of things and sensor data

Apache Edgent (Incubating)

a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.

Internet of things and sensor data

Azure IoT Hub

Cloud-based bi-directional monitoring and messaging hub

Internet of things and sensor data

Evrything

Making products smart

Internet of things and sensor data

IFTTT

If this then that

Internet of things and sensor data

NetLytics

Analytics platform to process network data on Spark.

Internet of things and sensor data

Pubnub

Data stream network

Internet of things and sensor data

TempoIQ

Cloud-based sensor analytics.

Internet of things and sensor data

ThingWorx

Rapid development and connection of intelligent systems

Internet of things and sensor data

Key Map Data Model(12 items)

Apache Accumulo

distributed key/value store, built on Hadoop.

Key Map Data Model

Apache Cassandra

column-oriented distributed datastore, inspired by BigTable.

Key Map Data Model

Apache HBase

column-oriented distributed datastore, inspired by BigTable.

Key Map Data Model

Baidu Tera

an Internet-scale database, inspired by BigTable.

Key Map Data Model

Facebook HydraBase

evolution of HBase made by Facebook.

Key Map Data Model

Google BigTable

column-oriented distributed datastore.

Key Map Data Model

Google Cloud Datastore

is a fully managed, schemaless database for storing non-relational data over BigTable.

Key Map Data Model

Hypertable

column-oriented distributed datastore, inspired by BigTable.

Key Map Data Model

InfiniDB

is accessed through a MySQL interface and use massive parallel processing to parallelize queries.

Key Map Data Model

ScyllaDB

column-oriented distributed datastore written in C++, totally compatible with Apache Cassandra.

Key Map Data Model

Tephra

Transactions for HBase.

Key Map Data Model

Twitter Manhattan

real-time, multi-tenant distributed database for Twitter scale.

Key Map Data Model

Key-value Data Model(25 items)

Aerospike

NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies."

Key-value Data Model

Amazon DynamoDB

distributed key/value store, implementation of Dynamo paper.

Key-value Data Model

Badger

a fast, simple, efficient, and persistent key-value store written natively in Go.

Key-value Data Model

Bolt

an embedded key-value database for Go.

Key-value Data Model

BTDB

Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more

Key-value Data Model

BuntDB

a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.

Key-value Data Model

Edis

is a protocol-compatible Server replacement for Redis.

Key-value Data Model

ElephantDB

Distributed database specialized in exporting data from Hadoop.

Key-value Data Model

EventStore

distributed time series database.

Key-value Data Model

GhostDB

a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

Key-value Data Model

Graviton

a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).

Key-value Data Model

GridDB

suitable for sensor data stored in a timeseries.

Key-value Data Model

HyperDex

a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.

Key-value Data Model

Ignite

is an in-memory key-value data store providing full SQL-compliant data access that can optionally be backed by disk storage.

Key-value Data Model

LinkedIn Krati

is a simple persistent data store with very low latency and high throughput.

Key-value Data Model

Linkedin Voldemort

distributed key/value storage system.

Key-value Data Model

Oracle NoSQL Database

distributed key-value database by Oracle Corporation.

Key-value Data Model

Redis

in memory key value datastore.

Key-value Data Model

Riak

a decentralized datastore.

Key-value Data Model

Storehaus

library to work with asynchronous key value stores, by Twitter.

Key-value Data Model

SummitDB

an in-memory, NoSQL key/value database, with disk persistence and using the Raft consensus algorithm.

Key-value Data Model

Tarantool

an efficient NoSQL database and a Lua application server.

Key-value Data Model

TiKV

a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.

Key-value Data Model

Tile38

a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON

Key-value Data Model

TreodeDB

key-value store that's replicated and sharded and provides atomic multirow writes.

Key-value Data Model

Machine Learning(41 items)

Azure ML Studio

Cloud-based AzureML, R, Python Machine Learning platform

Machine Learning

BidMach

CPU and GPU-accelerated Machine Learning Library.

Machine Learning

brain

Neural networks in JavaScript.

Machine Learning

Concurrent Pattern

machine learning library for Cascading.

Machine Learning

convnetjs

Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.

Machine Learning

DataVec

A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.

Machine Learning

Decider

Flexible and Extensible Machine Learning in Ruby.

Machine Learning

Deeplearning4j

Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.

Machine Learning

ENCOG

machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.

Machine Learning

etcML

text classification with machine learning.

Machine Learning

Etsy Conjecture

scalable Machine Learning in Scalding.

Machine Learning

Feast

A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.

Machine Learning

GraphLab Create

A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.

Machine Learning

H2O

statistical, machine learning and math runtime with Hadoop. R and Python.

Machine Learning

Karate Club

An unsupervised machine learning library for graph structured data. Python

Machine Learning

Keras

An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.

Machine Learning

Lambdo

Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.

Machine Learning

Little Ball of Fur

A subsampling library for graph structured data. Python

Machine Learning

Mahout

An Apache-backed machine learning library for Hadoop.

Machine Learning

ML Workspace

All-in-one web-based IDE specialized for machine learning and data science.

Machine Learning

MLbase

distributed machine learning libraries for the BDAS stack.

Machine Learning

MLPNeuralNet

Fast multilayer perceptron neural network library for iOS and Mac OS X.

Machine Learning

MOA

MOA performs big data stream mining in real time, and large scale machine learning.

Machine Learning

MonkeyLearn

Text mining made easy. Extract and classify data from text.

Machine Learning

ND4J

A matrix library for the JVM. Numpy for Java.

Machine Learning

nupic

Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.

Machine Learning

Oryx

Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning.

Machine Learning

PredictionIO

machine learning server built on Hadoop, Mahout and Cascading.

Machine Learning

PyTorch Geometric Temporal

a temporal extension library for PyTorch Geometric .

Machine Learning

RL4J

Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.

Machine Learning

SAMOA

distributed streaming machine learning framework.

Machine Learning

scikit-learn

scikit-learn: machine learning in Python.

Machine Learning

Shapley

A data-driven framework to quantify the value of classifiers in a machine learning ensemble.

Machine Learning

Sibyl

System for Large Scale Machine Learning at Google.

Machine Learning

Spark MLlib

a Spark implementation of some common machine learning (ML) functionality.

Machine Learning

TensorFlow

Library from Google for machine learning using data flow graphs.

Machine Learning

Theano

A Python-focused machine learning library supported by the University of Montreal.

Machine Learning

Torch

A deep learning library with a Lua API, supported by NYU and Facebook.

Machine Learning

Velox

System for serving machine learning predictions.

Machine Learning

Vowpal Wabbit

learning system sponsored by Microsoft and Yahoo!.

Machine Learning

WEKA

suite of machine learning software.

Machine Learning

Memcached forks and evolutions(5 items)

Facebook McDipper

key/value cache for flash storage.

Memcached forks and evolutions

Facebook Memcached

fork of Memcache.

Memcached forks and evolutions

Twemproxy

A fast, light-weight proxy for memcached and redis.

Memcached forks and evolutions

Twitter Fatcache

key/value cache for flash storage.

Memcached forks and evolutions

Twitter Twemcache

fork of Memcache.

Memcached forks and evolutions

MySQL forks and evolutions(9 items)

Amazon RDS

MySQL databases in Amazon's cloud.

MySQL forks and evolutions

Drizzle

evolution of MySQL 6.0.

MySQL forks and evolutions

Google Cloud SQL

MySQL databases in Google's cloud.

MySQL forks and evolutions

MariaDB

enhanced, drop-in replacement for MySQL.

MySQL forks and evolutions

MySQL Cluster

MySQL implementation using NDB Cluster storage engine.

MySQL forks and evolutions

Percona Server

enhanced, drop-in replacement for MySQL.

MySQL forks and evolutions

ProxySQL

High Performance Proxy for MySQL.

MySQL forks and evolutions

TokuDB

TokuDB is a storage engine for MySQL and MariaDB.

MySQL forks and evolutions

WebScaleSQL

is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.

MySQL forks and evolutions

NewSQL Databases(29 items)

Actian Ingres

commercially supported, open-source SQL relational database management system.

NewSQL Databases

ActorDB

a distributed SQL database with the scalability of a KV store, while keeping the query capabilities of a relational database.

NewSQL Databases

Amazon RedShift

data warehouse service, based on PostgreSQL.

NewSQL Databases

BayesDB

statistic oriented SQL database.

NewSQL Databases

Bedrock

a simple, modular, networked and distributed transaction layer built atop SQLite.

NewSQL Databases

CitusDB

scales out PostgreSQL through sharding and replication.

NewSQL Databases

Cockroach

Scalable, Geo-Replicated, Transactional Datastore.

NewSQL Databases

Comdb2

a clustered RDBMS built on optimistic concurrency control techniques.

NewSQL Databases

Datomic

distributed database designed to enable scalable, flexible and intelligent applications.

NewSQL Databases

FoundationDB

distributed database, inspired by F1.

NewSQL Databases

Google F1

distributed SQL database built on Spanner.

NewSQL Databases

Google Spanner

globally distributed semi-relational database.

NewSQL Databases

H-Store

is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.

NewSQL Databases

Haeinsa

linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.

NewSQL Databases

HandlerSocket

NoSQL plugin for MySQL/MariaDB.

NewSQL Databases

InfiniSQL

infinity scalable RDBMS.

NewSQL Databases

KarelDB

a relational database backed by Apache Kafka.

NewSQL Databases

Map-D

GPU in-memory database, big data analysis and visualization platform.

NewSQL Databases

MemSQL

in memory SQL database witho optimized columnar storage on flash.

NewSQL Databases

NuoDB

SQL/ACID compliant distributed database.

NewSQL Databases

Oracle TimesTen in-Memory Database

in-memory, relational database management system with persistence and recoverability.

NewSQL Databases

Pivotal GemFire XD

Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.

NewSQL Databases

SAP HANA

is an in-memory, column-oriented, relational database management system.

NewSQL Databases

SenseiDB

distributed, realtime, semi-structured database.

NewSQL Databases

Sky

database used for flexible, high performance analysis of behavioral data.

NewSQL Databases

SymmetricDS

open source software for both file and database synchronization.

NewSQL Databases

TiDB

TiDB is a distributed SQL database. Inspired by the design of Google F1.

NewSQL Databases

VoltDB

claims to be fastest in-memory database.

NewSQL Databases

yugabyteDB

open source, high-performance, distributed SQL database compatible with PostgreSQL.

NewSQL Databases

PostgreSQL forks and evolutions(8 items)

HadoopDB

hybrid of MapReduce and DBMS.

PostgreSQL forks and evolutions

IBM Netezza

high-performance data warehouse appliances.

PostgreSQL forks and evolutions

PipelineDB

The Streaming SQL Database. An open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables

PostgreSQL forks and evolutions

Postgres-XL

Scalable Open Source PostgreSQL-based Database Cluster.

PostgreSQL forks and evolutions

RecDB

Open Source Recommendation Engine Built Entirely Inside PostgreSQL.

PostgreSQL forks and evolutions

Stado

open source MPP database system solely targeted at data warehousing and data mart applications.

PostgreSQL forks and evolutions

TimescaleDB

An open-source time-series database optimized for fast ingest and complex queries

PostgreSQL forks and evolutions

Yahoo Everest

multi-peta-byte database / MPP derived by PostgreSQL.

PostgreSQL forks and evolutions

RDBMS(4 items)

MySQL

The world's most popular open source database.

Oracle Database

object-relational database management system.

PostgreSQL

The world's most advanced open source database.

Teradata

high-performance MPP data warehouse platform.

SQL-like processing(25 items)

Actian SQL for Hadoop

high performance interactive SQL access to all Hadoop data.

SQL-like processing

Apache Calcite

framework that allows efficient translation of queries involving heterogeneous and federated data.

SQL-like processing

Apache Drill

framework for interactive analysis, inspired by Dremel.

SQL-like processing

Apache HCatalog

table and storage management layer for Hadoop.

SQL-like processing

Apache Hive

SQL-like data warehouse system for Hadoop.

SQL-like processing

Apache Phoenix

SQL skin over HBase.

SQL-like processing

Aster Database

SQL-like analytic processing for MapReduce.

SQL-like processing

Cloudera Impala

framework for interactive analysis, Inspired by Dremel.

SQL-like processing

Concurrent Lingual

SQL-like query language for Cascading.

SQL-like processing

Datasalt Splout SQL

full SQL query engine for big datasets.

SQL-like processing

Dremio

an open-source, SQL-like Data-as-a-Service Platform based on Apache Arrow.

SQL-like processing

Facebook PrestoDB

distributed SQL query engine.

SQL-like processing

Google BigQuery

framework for interactive analysis, implementation of Dremel.

SQL-like processing

Iceberg

an open table format for huge analytic datasets. Iceberg adds tables to Trino and Spark that use a high-performance format that works just like a SQL table.

SQL-like processing

Invantive SQL

SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.

SQL-like processing

Materialize

is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.

SQL-like processing

PipelineDB

an open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables.

SQL-like processing

Pivotal HDB

SQL-like data warehouse system for Hadoop.

SQL-like processing

RainstorDB

database for storing petabyte-scale volumes of structured and semi-structured data.

SQL-like processing

Spark Catalyst

is a Query Optimization Framework for Spark and Shark.

SQL-like processing

SparkSQL

Manipulating Structured Data Using Spark.

SQL-like processing

Splice Machine

a full-featured SQL-on-Hadoop RDBMS with ACID transactions.

SQL-like processing

Stinger

interactive query for Hive.

SQL-like processing

Tajo

distributed data warehouse system on Hadoop.

SQL-like processing

Trafodion

enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.

SQL-like processing

Scheduling(11 items)

Apache Airflow

a platform to programmatically author, schedule and monitor workflows.

Apache Aurora

is a service scheduler that runs on top of Apache Mesos.

Apache Falcon

data management framework.

Apache Oozie

workflow job scheduler.

Azure Data Factory

cloud-based pipeline orchestration for on-prem, cloud and HDInsight

Chronos

distributed and fault-tolerant scheduler.

Cronicle

Distributed, easy to install, NodeJS based, task scheduler

Dagster

a data orchestrator for machine learning, analytics, and ETL.

Linkedin Azkaban

batch workflow job scheduler.

Schedoscope

Scala DSL for agile scheduling of Hadoop jobs.

Sparrow

scheduling platform.

Search engine and framework(19 items)

Annoy

is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

Search engine and framework

Apache Lucene

Search engine library.

Search engine and framework

Apache Solr

Search platform for Apache Lucene.

Search engine and framework

Elassandra

is a fork of Elasticsearch modified to run on top of Apache Cassandra in a scalable and resilient peer-to-peer architecture.

Search engine and framework

ElasticSearch

Search and analytics engine based on Apache Lucene.

Search engine and framework

Enigma.io

– Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.

Search engine and framework

Facebook Faiss

is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.

Search engine and framework

Google Caffeine

continuous indexing system.

Search engine and framework

Google Percolator

continuous indexing system.

Search engine and framework

HBase Coprocessor

implementation of Percolator, part of HBase.

Search engine and framework

Lily HBase Indexer

quickly and easily search for any content stored in HBase.

Search engine and framework

LinkedIn Bobo

is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.

Search engine and framework

LinkedIn Cleo

is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.

Search engine and framework

LinkedIn Galene

search architecture at LinkedIn.

Search engine and framework

LinkedIn Zoie

is a realtime search/indexing system written in Java.

Search engine and framework

MG4J

MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new research algorithms.

Search engine and framework

Sphinx Search Server

fulltext search engine.

Search engine and framework

Vespa

is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.

Search engine and framework

Weaviate

Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.

Search engine and framework

Security(5 items)

Apache Eagle

real time monitoring solution

Apache Knox Gateway

single point of secure access for Hadoop clusters.

Apache Ranger

Central security admin & fine-grained authorization for Hadoop

Apache Sentry

security module for data stored in Hadoop.

BDA

The vulnerability detector for Hadoop and Spark

Service Programming(16 items)

Akka Toolkit

runtime for distributed, and fault tolerant event-driven applications on the JVM.

Service Programming

Apache Avro

data serialization system.

Service Programming

Apache Curator

Java libraries for Apache ZooKeeper.

Service Programming

Apache Karaf

OSGi runtime that runs on top of any OSGi framework.

Service Programming

Apache Thrift

framework to build binary protocols.

Service Programming

Apache Zookeeper

centralized service for process management.

Service Programming

Google Chubby

a lock service for loosely-coupled distributed systems.

Service Programming

Hydrosphere Mist

a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.

Service Programming

Linkedin Norbert

cluster manager.

Service Programming

Mara

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

Service Programming

OpenMPI

message passing framework.

Service Programming

Serf

decentralized solution for service discovery and orchestration.

Service Programming

Spotify Luigi

a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Service Programming

Spring XD

distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.

Service Programming

Twitter Elephant Bird

libraries for working with LZOP-compressed data.

Service Programming

Twitter Finagle

asynchronous network stack for the JVM.

Service Programming

System Deployment(17 items)

Apache Ambari

operational framework for Hadoop management.

System Deployment

Apache Bigtop

system deployment framework for the Hadoop ecosystem.

System Deployment

Apache Helix

cluster management framework.

System Deployment

Apache Mesos

cluster manager.

System Deployment

Apache Slider

is a YARN application to deploy existing distributed applications on YARN.

System Deployment

Apache Whirr

set of libraries for running cloud services.

System Deployment

Apache YARN

Cluster manager.

System Deployment

Brooklyn

library that simplifies application deployment and management.

System Deployment

Buildoop

Similar to Apache BigTop based on Groovy language.

System Deployment

Cloudera HUE

web application for interacting with Hadoop.

System Deployment

Facebook Prism

multi datacenters replication system.

System Deployment

Google Borg

job scheduling and monitoring system.

System Deployment

Google Omega

job scheduling and monitoring system.

System Deployment

Hortonworks HOYA

application that can deploy HBase cluster on YARN.

System Deployment

Kubernetes

a system for automating deployment, scaling, and management of containerized applications.

System Deployment

Linkis

Linkis helps easily connect to various back-end computation/storage engines.

System Deployment

Marathon

Mesos framework for long-running services.

System Deployment

Time-Series Databases(25 items)

Akumuli

Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".

Time-Series Databases

Axibase Time Series Database

Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.

Time-Series Databases

Beringei

Facebook's in-memory time-series database.

Time-Series Databases

Blueflood

A distributed system designed to ingest and process time series data

Time-Series Databases

Chronix

a time series storage built to store time series highly compressed and for fast access times.

Time-Series Databases

Cube

uses MongoDB to store time series data.

Time-Series Databases

Dalmatiner DB

Fast distributed metrics database

Time-Series Databases

Druid

Column oriented distributed data store ideal for powering interactive applications

Time-Series Databases

Heroic

is a scalable time series database based on Cassandra and Elasticsearch.

Time-Series Databases

InfluxDB

a time series database with optimised IO and queries, supports pgsql and influx wire protocols.

Time-Series Databases

IronDB

scalable, general-purpose time series database.

Time-Series Databases

Kairosdb

similar to OpenTSDB but allows for Cassandra.

Time-Series Databases

M3DB

a distributed time series database that can be used for storing realtime metrics at long retention.

Time-Series Databases

Newts

a time series database based on Apache Cassandra.

Time-Series Databases

OpenTSDB

distributed time series database on top of HBase.

Time-Series Databases

Prometheus

a time series database and service monitoring system.

Time-Series Databases

QuestDB

high-performance, open-source SQL database for applications in financial services, IoT, machine learning, DevOps and observability.

Time-Series Databases

Rhombus

A time-series object store for Cassandra that handles all the complexity of building wide row indexes.

Time-Series Databases

Riak-TS

Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.

Time-Series Databases

SiriDB

Highly-scalable, robust and fast, open source time series database with cluster functionality.

Time-Series Databases

TDengine

a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data

Time-Series Databases

Thanos

Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) Prometheus deployments.

Time-Series Databases

Timely

Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.

Time-Series Databases

TrailDB

an efficient tool for storing and querying series of events.

Time-Series Databases

VictoriaMetrics

fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included

Time-Series Databases

Videos(4 items)

Data warehouse schema design - dimensional mode...

Introduction to schema design for data warehouse using the star schema method.

Elasticsearch 7 and Elastic Stack

LiveVideo tutorial that covers searching, analyzing, and visualizing big data on a cluster with Elasticsearch, Logstash, Beats, Kibana, and more.

Machine Learning, Data Science and Deep Learnin...

LiveVideo tutorial that covers machine learning, Tensorflow, artificial intelligence, and neural networks.

Spark in Motion

Spark in Motion teaches you how to use Spark for batch and streaming data analytics.