Framework for distributed storage and processing of very large data sets.
The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.
Mozilla's utility library for Hadoop, HBase, Pig, etc.
A collection of libraries for working with large-scale data in Hadoop
Apache Pig
Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
Open Source Big Data Security Analytics
PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
Machine learning and natural language processing with Apache Pig
Apache Flume
Apache Kafka
Apache Sqoop
Universal data ingestion framework for Hadoop
Netflix's distributed Data Pipeline
Metadata tagging & lineage capture suppoting complex business data taxonomies
A Dynamic Data Management Framework
Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.
Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas.
Schema Registry is a framework to build metadata repositories.
Enterprise-grade unified stream and batch processing engine.
Apache Flink is a platform for efficient, distributed, general-purpose data processing.
Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.
Cascading is the proven application development platform for building data applications on Hadoop.
A community index of packages for Apache Spark
A community site for Apache Spark
Apache Hadoop
An Object Store for Apache Hadoop
Distributed in-memory platform
Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
A Framework for YARN-based, Data Processing Applications In Hadoop
Go-based toolkit for ETL and feature extraction on Hadoop
Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
Big Data Spatial Analytics for the Hadoop Framework
Python MapReduce library written in Cython.
HDFS-DU is an interactive visualization of the Hadoop distributed file system.
mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
Pydoop is a package that provides a Python API for Hadoop.
SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
Hadoop log aggregator and dashboard
Apache Avro is a data serialization system.
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application
A web-based notebook that enables interactive data analytics
Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.
Native go clients for Apache Hadoop YARN.
A Web interface for analyzing data with Apache Hadoop.
A set of libraries, tools, examples, and documentation
A graphical editor for editing Apache Oozie workflows inside Eclipse.
Web tool for the Confluent Schema Registry in order to create / view / search / evolve / view history & configure Avro schemas of your Kafka cluster.
A pure python HDFS client
Apache Hivemall is a scalable machine learning library that runs on Apache Hive, Spark and Pig.
SINGA is a general distributed deep learning platform for training big deep learning models over large datasets
BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.
MLlib is Apache Spark's scalable machine learning library.
Lambda architecture on Spark, Kafka for real-time large scale machine learning
R is a free software environment for statistical computing and graphics.
including RHDFS, RHBase, RMR2, plyrmr
A super simple utility for testing Apache Hive scripts locally for non-Java developers.
An Open Source unit test framework for hadoop hive queries based on JUnit4
Unit test framework for hive and hive-service
Python interface to Hive and Presto
WebUI for query engines: Hive and Presto
The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
Apache HBase
A SQL skin over HBase supporting secondary indices
Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
A developer-friendly Python library to interact with Apache HBase.
Secondary Index for HBase
The Scalable Time Series Database
A big data cluster management tool that creates and manages clusters of different technologies.
Apache Ambari
Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
ZooKeeper client wrapper and rich ZooKeeper framework
Apache Zookeeper
Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.
Send logs from Hadoop to Elasticsearch for monitoring and alerting.
A high-performance, column-oriented, distributed data store.
Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.
Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication.
Schema-free SQL Query Engine
Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.
A SQL skin over HBase supporting secondary indices
Data warehouse system for Apache Hadoop
SQL interface for Cascading (MR/Tez job generator)
Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
Data management and processing platform
A dataflow system
Apache Oozie
Python package that helps you build complex pipelines of batch jobs
Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
Running MPICH2 on Yarn