Empirical Software Engineering

Evidence-based research on software systems.

69 resources5 categoriesView Original

Data Sets(29 items)

A

AndroidTimeMachine

Graph-based dataset of commit history of 8,431 real-world Android apps.

Data Sets
A

AndroZoo

Collection of Android Applications.

Data Sets
B

Bug Prediction Dataset

Collection of models and metrics from Eclipse JDT Core, PDE UI, Equinox Framework, Lucene, Mylyn, and their histories.

Data Sets
C

Code Reviews

Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse.

Data Sets
C

CoREBench

Collection of 70 realistically Complex Regression Errors that were systematically extracted from the repositories and bug reports of four open-source software projects: Make, Grep, Findutils, and Coreutils.

Data Sets
C

Cryptocurrency GitHub Activity and Market Cap D...

Activity such as commits, stars, prices, and market cap of over 200 cryptocurrency projects on GitHub over time. Raw, historic data is also available.

Data Sets
D

Defects4J

Collection of 395 reproducible bugs collected with the goal of advancing software testing research.

Data Sets
E

Eclipse AERI stacktraces

Collection of stacktraces of Exceptions encountered by users of the Eclipse IDE, as retrieved by the AERI reporting system.

Data Sets
E

Enron Spreadsheets and Emails

All the spreadsheets and emails used in the paper 'Enron's Spreadsheets and Related Emails: A Dataset and Analysis'.

Data Sets
F

Findbugs-maven

Set of FindBugs reports for the Java projects of the Maven repository.

Data Sets
G

GHTorrent

Scalable, queriable, offline mirror of data offered through the GitHub REST API.

Data Sets
G

GitHub Bug Dataset

Bug Dataset of 15 Java open-source projects characterized by static source code metrics.

Data Sets
G

GitHub on Google BigQuery

GitHub data accessible through Google's BigQuery platform.

Data Sets
G

Grammar Zoo

Collection of grammars of DSLs and GPLs, some extracted from metamodels and document schemata.

Data Sets
K

KaVE

Developer tool interaction data.

Data Sets
L

Linux Kernel 4.21 Call Graphs

The Linux Kernel 4.21 Call Graphs produced using CScout.

Data Sets
M

Maven Dependency Graph

Snapshot of the whole Maven Central taken on September 6, 2018, stored in a graph database.

Data Sets
M

Maven metrics

Collection of software complexity & sizing metrics for the Maven Repository.

Data Sets
M

mzdata

Multi-extract and multi-level dataset of Mozilla issue tracking history.

Data Sets
N

npm-miner

The dataset contains the analysis results of 5 open source software quality tools eslint, escomplex, nsp, jsinspect and sonarjs for 2000 popular (in terms of stars and downloads) npm packages.

Data Sets
O

OCL Expressions on GitHub

Data set of 9188 OCL expressions originating from 504 EMF meta-models in 245 systematically selected GitHub repositories.

Data Sets
R

RepoReapers Data Set

Data set containing a collection of engineered software projects from GHTorrent.

Data Sets
S

Software Heritage Graph Dataset

Graph of the development history and file metadata of >80 million software projects from various forges (GitHub, Gitlab, Debian, PyPI, Google Code, etc) in a deduplicated and unified representation (paper here).

Data Sets
S

Stack Exchange

Anonymized dump of all user-contributed content on the Stack Exchange network.

Data Sets
S

STAMINA

(STAte Machine INference Approaches) data are used to benchmark techniques for learning deterministic finite state machines (FSMs).

Data Sets
T

TravisTorrent

Provides free and easy-to-use Traivs CI build analyses.

Data Sets
U

Ultimate Debian Database (UDD)

Data about various aspects of Debian (e.g. packages, bugs, mainteners) in the same SQL database.

Data Sets
U

Unified Bug Dataset

Static source code based datasets which includes the Bugcatchers Bug Dataset, the Bug Prediction Dataset, the Eclipse Bug Dataset, the GitHub Bug Dataset, some datasets from the PROMISE repository.

Data Sets
U

Unix history

Git repository with 46 years of Unix history evolution.

Data Sets

Tools(20 items)

A

astminer

Library and tool for mining of path-based representations of code and other data derived from ASTs.

Tools
B

Boa

Domain-specific language and infrastructure that eases mining software repositories.

Tools
B

buckwheat

Multi-language tokenizer for extracting identifiers from source code.

Tools
C

ckjm

Chidamber and Kemerer Java Metrics.

Tools
C

Coming

A Java framework for analyzing code changes and mining instances of change patterns from Git repositories.

Tools
C

CryptOSS

Mine GitHub activity and market cap data for cryptocurrency projects.

Tools
D

DbDeo

Extract embedded SQL statements and detect database schema smells.

Tools
D

Designite

Compute source code metrics and detect a variety of implementation, design, and architecture smells for C#.

Tools
D

DesigniteJava

Compute source code metrics and detect a variety of implementation and design smells for Java.

Tools
D

Diggit

Agile Ruby Tool to analyze Git repositories.

Tools
G

GrimoireLab

Free/Libre/Open Source tools for Software Development Analytics.

Tools
M

Maven-miner

Java tools and infrastructure to resolve the whole Maven dependency graph, hosted in Maven Central, in the form of a Neo4j Graph.

Tools
M

MetricMiner

Lean Java DSL to mine and extract data (e.g. commits, developers, modifications, diffs) from Git and SVN repositories.

Tools
P

Perceval

Fetch repository data from tens of back-ends.

Tools
P

Puppeteer

Detect configuration smells in Puppet code.

Tools
P

PyDriller

Python Framework to analyse Git repositories.

Tools
Q

qmcalc

Calculate quality metrics from C source code.

Tools
R

reaper

Python tool to compute a score for a repository from GHTorrent. The score quantifies the extent to which the project contained within the repository is engineered.

Tools
R

RefactoringMiner

Library/API for detection of refactorings in changes of Java code.

Tools
V

VulData7

Java framework enabling the automated collection of commits fixing vulnerabilities that are reported in NVD (links NVD with Git).

Tools