Information Retrieval

Learn to develop your own search engine.

97 resources7 categoriesView Original

Datasets(44 items)

2

20 Newsgroup dataset

This data set consists of 20000 newsgroup messages.posts taken from 20 newsgroup topics.

Datasets
A

Advanced Cross Linugal Information Retrieval an...

The dataset is used for the task of cross-lingual question answering but the complexity of the task is higher than CLQA dataset.

Datasets
B

Blog

Explore information seeking behavior in the blogosphere.

Datasets
C

Chemical IR

Address challenges in building large chemical testbeds for chemical IR.

Datasets
C

Clinical Decision Support

Investigate techniques to link medical cases to information relevant for patient care.

Datasets
C

CLIR Test Collections

This dataset can be used for cross lingual IR between CJKE (Chinese-Japanese-Korean-English) languages. It is suitable for the following tasks:

Datasets
C

CMU List

Datasets
C

Conference and Labs of the Evaluation Forum (CL...

It contains a multi-lingual document collection. The test suite includes:

Datasets
C

Confusion

Study Known Item Searching problem.

Datasets
C

Contextual Suggestion

Investigate search techniques for complex information needs (context and user interests based).

Datasets
C

Cranfield Collections

This is one of the first collections in IR domain, however the dataset is too small for any statistical significance analysis, but is nevertheless suitable for pilot runs.

Datasets
C

Cross Language Q&A (CLQA) dataset collection

It supports following bi-lingua and mono-lingua:

Datasets
C

Crowdsourcing

Explore crowdsourcing methods for performing and evaluating search.

Datasets
D

DBPedia

Linked data web.

Datasets
D

Document Understanding Conference (DUC) datasets

Past newswire/paper datasets (DUC 2001 - DUC 2007) are available upon request.

Datasets
E

English Gigaword Fifth Edition

This data set is a comprehensive archive of English newswire text data including headlines, datelines and articles.

Datasets
E

Enterprise

Study search over the organization data.

Datasets
E

Entity

Perform entity-related search (find entities and their properties) on Web data.

Datasets
F

Federated Web Search

Study merge performance for results from various search services.

Datasets
F

Filtering

Binarily decide retrieval of new incoming documents given a stable information need.

Datasets
G

Genomics

Study retrieval efficiency of genomics data and corresponding documentation.

Datasets
G

GOV2 Test Collection

This is one of the largest Web collection of documents obtained from crawl of government websites by Charlie Clarke and Ian Soboroff, using NIST hardware and network, then formatted by Nick Craswel.

Datasets
H

HARD

Obtain High Accuracy Retrieval from Documents by leveraging searcher's context.

Datasets
I

Interactive Track

Study user interaction with text retrieval systems.

Datasets
K

Knowledge base acceleration

Study algorithms that improve efficiency of human Knowledge Base.

Datasets
L

Legal Track

Study retrieval systems that have high recall for legal documents use case.

Datasets
M

Medical Track

Explore unstructured search performance over patients record data.

Datasets
M

Microblog Track

Examine satisfaction of real-time information need for microblogging sites.

Datasets
M

Million Query Track

Explore ad-hoc retrieval over large set of queries.

Datasets
N

Novelty Track

Investigate systems' abilities to locate new (non-redundant) information.

Datasets
N

NTCIR Test Collection

This is collection of wide variety of dataset ranging from Ad-hoc collection, Chinese IR collection, mobile clickthrough collections to medical collections. The focus of this collection is mostly on east asian languages and cross language information retrieval.

Datasets
Q

Question Answering Track

Test systems that scale beyond document retrieval, to retrieve answers to factoid, list and definition type questions.

Datasets
R

Relevance Feedback Track

For deep evaluation of relevance feedback processes.

Datasets
R

Reuters Corpora

The corpora is now available through NIST. The corpora includes following:

Datasets
R

Robust Track

Study individual topic's effectiveness.

Datasets
S

Session Track

Develop methods for measuring multiple-query sessions where information needs drift.

Datasets
S

SPAM Track

Benchmark spam filtering approaches.

Datasets
S

Stanford List

Datasets
T

Tasks Track

Test if systems can induce possible tasks, users might be trying to accomplish for the query.

Datasets
T

Temporal Summarization Track

Develop systems that allow users to efficiently monitor the information associated with an event over time.

Datasets
T

Terabyte Track

Test scalability of IR systems to large scale collection.

Datasets
T

TREC Collections

TREC is the benchmark dataset used by most IR and Web search algorithms. It has several tracks, each of which consists of dataset to test for a specific task. The tracks along with suggested use-case are:

Datasets
U

University of Tennesse Knoxville

Datasets
W

Web Track

Explore information seeking behaviors common in general web search.

Datasets

Talks(14 items)

B

Beware online "filter bubbles"

Eli Pariser (Author of the Filter Bubble, TED Talk).

Talks
C

Challenges in Building Large-Scale Information ...

Jeff Dean (WSDM Conference, 2009).

Talks
D

Do we have the right to be forgotten?

Michael Douglas [TEDx SouthBank].

Talks
E

Extreme Classification: A New Paradigm for Rank...

Manik Verma (Microsoft Research)

Talks
I

Information Experience - Solution to Informatio...

Doug Imbruce (Techcrunch Disrupt)[Doug Imbruce is the Founder of Qwiki, Inc, a technology startup in New York, NY, acquired by Yahoo! in 2013].

Talks
I

Internet Privacy

Dr. Alma Whitten (Google Brussels Tech Talk).

Talks
I

Is Pivot a turning point for web exploration?

Gary Flake, Technical Fellow at Microsoft (TED Talks).

Talks
K

Knowledge-based Information Retrieval with Wiki...

David Wilne (The University of Waikato, 2008).

Talks
M

Music Information Retrieval Using Locality Sens...

Steve Tjoa (RackSpace Developers) [This talk shows that IR is not just text and images].

Talks
T

The case for anonymity online

Christopher "moot" Poole" (Ted Talks) [Christopher "moot" Poole is founder of 4chan, an online imageboard whose anonymous denizens have spawned the web's most bewildering and influential subculture].

Talks
T

The Functional Web -- The Future of Apps and th...

Liron Shapira (Box Tech Talk).

Talks
T

The moral bias behind your search results

Andreas Ekström (Swedish Author & Journalist, TED Talk).

Talks
T

The next web

Tim Berners-Lee (Ted Talk) [Tim Berners-Lee invented the World Wide Web. He leads the World Wide Web Consortium (W3C), overseeing the Web's standards and development].

Talks
T

Think your email's private? Think again

Andy Yen (CERN, TED Talk) [This talk talks about privacy, which Search Engines intrude into, and how can people protect it].

Talks