Hadoop

Framework for distributed storage and processing of very large data sets.

146 resources22 categoriesView Original

Quick Navigation

Benchmark(3 items)

Big Data Benchmark

Benchmark

HiBench

Benchmark

YCSB

The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.

Benchmark

Books(8 items)

DSL(8 items)

akela

Mozilla's utility library for Hadoop, HBase, Pig, etc.

DSL

Apache DataFu

A collection of libraries for working with large-scale data in Hadoop

DSL

Apache Pig

DSL

Lipstick

Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig

DSL

packetpig

Open Source Big Data Security Analytics

DSL

PigPen

PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.

DSL

seqpig

Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop

DSL

vahara

Machine learning and natural language processing with Apache Pig

DSL

Data Ingestion and Integration(5 items)

Apache Flume

Data Ingestion and Integration

Apache Kafka

Data Ingestion and Integration

Apache Sqoop

Data Ingestion and Integration

Gobblin from LinkedIn

Universal data ingestion framework for Hadoop

Data Ingestion and Integration

Suro

Netflix's distributed Data Pipeline

Data Ingestion and Integration

Data Management(5 items)

Apache Atlas

Metadata tagging & lineage capture suppoting complex business data taxonomies

Data Management

Apache Calcite

A Dynamic Data Management Framework

Data Management

Apache Kudu

Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.

Data Management

Confluent Schema registry for Kafka

Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas.

Data Management

Hortonworks Schema Registry

Schema Registry is a framework to build metadata repositories.

Data Management

Distributed Computing and Programming(8 items)

Apache Apex (incubating)

Enterprise-grade unified stream and batch processing engine.

Distributed Computing and Programming

Apache Crunch

Distributed Computing and Programming

Apache Flink

Apache Flink is a platform for efficient, distributed, general-purpose data processing.

Distributed Computing and Programming

Apache Livy (incubating)

Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.

Distributed Computing and Programming

Apache Spark

Distributed Computing and Programming

Cascading

Cascading is the proven application development platform for building data applications on Hadoop.

Distributed Computing and Programming

Spark Packages

A community index of packages for Apache Spark

Distributed Computing and Programming

SparkHub

A community site for Apache Spark

Distributed Computing and Programming

Hadoop(15 items)

Apache Hadoop

Hadoop

Apache Hadoop Ozone

An Object Store for Apache Hadoop

Hadoop

Apache Ignite

Distributed in-memory platform

Hadoop

Apache Kylin

Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

Hadoop

Apache Tez

A Framework for YARN-based, Data Processing Applications In Hadoop

Hadoop

Crunch

Go-based toolkit for ETL and feature extraction on Hadoop

Hadoop

Elasticsearch Hadoop

Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.

Hadoop

Genie

Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.

Hadoop

GIS Tools for Hadoop

Big Data Spatial Analytics for the Hadoop Framework

Hadoop

hadoopy

Python MapReduce library written in Cython.

Hadoop

hdfs-du

HDFS-DU is an interactive visualization of the Hadoop distributed file system.

Hadoop

mrjob

mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.

Hadoop

pydoop

Pydoop is a package that provides a Python API for Hadoop.

Hadoop

SpatialHadoop

SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.

Hadoop

White Elephant

Hadoop log aggregator and dashboard

Hadoop

Hadoop and Big Data Events(4 items)

ApacheCon

Hadoop and Big Data Events

DataWorks Summit

Hadoop and Big Data Events

Spark Summit

Hadoop and Big Data Events

Strata + Hadoop World

Hadoop and Big Data Events

Libraries and Tools(14 items)

Apache Avro

Apache Avro is a data serialization system.

Libraries and Tools

Apache Parquet

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Libraries and Tools

Apache Superset (incubating)

Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application

Libraries and Tools

Apache Thrift

Libraries and Tools

Apache Zeppelin

A web-based notebook that enables interactive data analytics

Libraries and Tools

Elephant Bird

Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Libraries and Tools

gohadoop

Native go clients for Apache Hadoop YARN.

Libraries and Tools

hdfs - A native go client for HDFS

Libraries and Tools

Hue

A Web interface for analyzing data with Apache Hadoop.

Libraries and Tools

Kite Software Development Kit

A set of libraries, tools, examples, and documentation

Libraries and Tools

Oozie Eclipse Plugin

A graphical editor for editing Apache Oozie workflows inside Eclipse.

Libraries and Tools

Schema Registry UI

Web tool for the Confluent Schema Registry in order to create / view / search / evolve / view history & configure Avro schemas of your Kafka cluster.

Libraries and Tools

snakebite

A pure python HDFS client

Libraries and Tools

Spring for Apache Hadoop

Libraries and Tools

Machine learning and Big Data analytics(9 items)

Apache Hivemall (incubating)

Apache Hivemall is a scalable machine learning library that runs on Apache Hive, Spark and Pig.

Machine learning and Big Data analytics

Apache Lens

Machine learning and Big Data analytics

Apache Mahout

Machine learning and Big Data analytics

Apache SINGA (incubating)

SINGA is a general distributed deep learning platform for training big deep learning models over large datasets

Machine learning and Big Data analytics

BigDL

BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.

Machine learning and Big Data analytics

MLlib

MLlib is Apache Spark's scalable machine learning library.

Machine learning and Big Data analytics

Oryx 2

Lambda architecture on Spark, Kafka for real-time large scale machine learning

Machine learning and Big Data analytics

R

R is a free software environment for statistical computing and graphics.

Machine learning and Big Data analytics

RHadoop

including RHDFS, RHBase, RMR2, plyrmr

Machine learning and Big Data analytics

Misc.(9 items)

.Net FlumeNG Clients

Misc.

Beetest

A super simple utility for testing Apache Hive scripts locally for non-Java developers.

Misc.

Flume MongoDB Sink

Misc.

Flume RabbitMQ source and sink

Misc.

Flume UDP Source

Misc.

HiveRunner

An Open Source unit test framework for hadoop hive queries based on JUnit4

Misc.

Hive_test

Unit test framework for hive and hive-service

Misc.

PyHive

Python interface to Hive and Presto

Misc.

shib

WebUI for query engines: Hive and Presto

Misc.

NoSQL(9 items)

Apache Accumulo

The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

NoSQL

Apache Cassandra

NoSQL

Apache HBase

NoSQL

Apache Phoenix

A SQL skin over HBase supporting secondary indices

NoSQL

Haeinsa

Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase

NoSQL

Hannibal

Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.

NoSQL

happybase

A developer-friendly Python library to interact with Apache HBase.

NoSQL

hindex

Secondary Index for HBase

NoSQL

OpenTSDB

The Scalable Time Series Database

NoSQL

Packaging, Provisioning and Monitoring(8 items)

ankush

A big data cluster management tool that creates and manages clusters of different technologies.

Packaging, Provisioning and Monitoring

Apache Ambari

Packaging, Provisioning and Monitoring

Apache Bigtop

Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem

Packaging, Provisioning and Monitoring

Apache Curator

ZooKeeper client wrapper and rich ZooKeeper framework

Packaging, Provisioning and Monitoring

Apache Zookeeper

Packaging, Provisioning and Monitoring

Ganglia Monitoring System

Packaging, Provisioning and Monitoring

inviso

Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.

Packaging, Provisioning and Monitoring

Logit.io

Send logs from Hadoop to Elasticsearch for monitoring and alerting.

Packaging, Provisioning and Monitoring

Presentations(4 items)

Apache Hadoop In Theory And Practice

Presentations

Docker based Hadoop provisioning

Presentations

Hadoop Operations at LinkedIn

Presentations

Hadoop Performance at LinkedIn

Presentations

Realtime Data Processing(6 items)

Apache Druid (incubating)

A high-performance, column-oriented, distributed data store.

Realtime Data Processing

Apache Flink

Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.

Realtime Data Processing

Apache Pulsar (incubating)

Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication.

Realtime Data Processing

Apache Samza

Realtime Data Processing

Apache Spark

Realtime Data Processing

Apache Storm

Realtime Data Processing

SQL on Hadoop(9 items)

Apache Drill

Schema-free SQL Query Engine

SQL on Hadoop

Apache HAWQ (incubating)

Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop

SQL on Hadoop

Apache Hive

The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL

SQL on Hadoop

Apache Impala

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

SQL on Hadoop

Apache Phoenix

A SQL skin over HBase supporting secondary indices

SQL on Hadoop

Apache Tajo

Data warehouse system for Apache Hadoop

SQL on Hadoop

Apache Trafodion

SQL on Hadoop

Lingual

SQL interface for Cascading (MR/Tez job generator)

SQL on Hadoop

Presto

Distributed SQL Query Engine for Big Data. Open sourced by Facebook.

SQL on Hadoop

Search(3 items)

Apache Solr

Apache Solr is an open source search platform built upon a Java library called Lucene.

Banana

Kibana port for Apache Solr

ElasticSearch

Search Engine Framework(1 items)

Apache Nutch

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Search Engine Framework

Security(3 items)

Apache Knox Gateway

A REST API Gateway for interacting with Hadoop clusters.

Security

Apache Ranger

Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.

Security

Apache Sentry

An authorization module for Hadoop

Security

Websites(6 items)

AWS BigData Blog

Websites

Hadoop illuminated

Open Source Hadoop Book

Websites

Hadoop Weekly

Websites

Hadoop360

Websites

How to monitor Hadoop metrics

Websites

The Hadoop Ecosystem Table

Websites

Workflow, Lifecycle and Governance(6 items)

Apache AirFlow

Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines

Workflow, Lifecycle and Governance

Apache Falcon

Data management and processing platform

Workflow, Lifecycle and Governance

Apache NiFi

A dataflow system

Workflow, Lifecycle and Governance

Apache Oozie

Workflow, Lifecycle and Governance

Azkaban

Workflow, Lifecycle and Governance

Luigi

Python package that helps you build complex pipelines of batch jobs

Workflow, Lifecycle and Governance

YARN(3 items)

Apache Slider

Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.

YARN

Apache Twill

Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.

YARN

mpich2-yarn

Running MPICH2 on Yarn

YARN

Hadoop

Quick Navigation

Benchmark(3 items)

Big Data Benchmark

HiBench

YCSB

Books(8 items)

Apache Hadoop Yarn

Hadoop in Action, Second Edition

Hadoop in Practice, Second Edition

Hadoop Operations

Hadoop: The Definitive Guide

HBase: The Definitive Guide

Programming Hive

Programming Pig

DSL(8 items)

akela

Apache DataFu

Apache Pig

Lipstick

packetpig

PigPen

seqpig

vahara

Data Ingestion and Integration(5 items)

Apache Flume

Apache Kafka

Apache Sqoop

Gobblin from LinkedIn

Suro

Data Management(5 items)

Apache Atlas

Apache Calcite

Apache Kudu

Confluent Schema registry for Kafka

Hortonworks Schema Registry

Distributed Computing and Programming(8 items)

Apache Apex (incubating)

Apache Crunch

Apache Flink

Apache Livy (incubating)

Apache Spark

Cascading

Spark Packages

SparkHub

Hadoop(15 items)

Apache Hadoop

Apache Hadoop Ozone

Apache Ignite

Apache Kylin

Apache Tez

Crunch

Elasticsearch Hadoop

Genie

GIS Tools for Hadoop

hadoopy

hdfs-du

mrjob

pydoop

SpatialHadoop

White Elephant

Hadoop and Big Data Events(4 items)

ApacheCon

DataWorks Summit

Spark Summit

Strata + Hadoop World

Libraries and Tools(14 items)

Apache Avro

Apache Parquet

Apache Superset (incubating)

Apache Thrift

Apache Zeppelin

Elephant Bird

gohadoop

hdfs - A native go client for HDFS

Hue

Kite Software Development Kit

Oozie Eclipse Plugin

Schema Registry UI

snakebite