Information Retrieval

W.B. Croft, J. Lafferty. Springer, 2003. (Handles Language Modeling aspect of Information Retrieval. It also extensively details probabilistic perspective in this domain, which is interesting).

Books

Mining the Web: Analysis of Hypertext and Semi ...

S. Chakrabarti. Morgan Kaufmann, 2002.

Books

Modern Information Retrieval

R. Baeza-Yates, B. Ribeiro-Neto. Addison-Wesley, 1999.

Books

Search Engines: Information Retrieval in Practice

Bruce Croft, Don Metzler, and Trevor Strohman. 2009. (Great book for readers interested in knowing how Search Engines work. The book is very detailed).

Books

Text Information Retrieval Systems

C.T. Meadow, B.R. Boyce, D.H. Kraft, C.L. Barry. Academic Press, 2007 (library/information science perspective).

Books

Conferences(9 items)

CIKM

Conference on Information and Knowledge Management - .

Conferences

CLEF

Conference and Labs of the Evaluation Forum - .

Conferences

ECIR

European Conference on Information Retrieval - .

Conferences

FIRE

Forum for Information Retrieval Evaluation - .

Conferences

NTCIR

NII Testsbeds and Community for Information access Research - .

Conferences

SIGIR

Special Interests Group on Information Retrieval - .

Conferences

TREC

Text REtrieval Conference - .

Conferences

WSDM

Web Search and Data Mining Conference - .

Conferences

WWW

World Wide Web Conference - .

Conferences

Courses(10 items)

11-442 / 11-642: Search Engines

Jamie Callan (CMU).

Courses

600.466: Information Retrieval and Web Agents

David Yarowsky (John Hopkins University).

Courses

Coursera - Text Retrieval and Search Engines

Prof. ChengXiang Zhai (University of Illinois at Urbana-Champaign).

Courses

CS 172: Introduction to Information Retrieval

Vagelis Hristidis (University of California - Riverside).

Courses

CS 276 / LING 286: Information Retrieval and We...

Chris Manning and Pandu Nayak (Stanford University).

Courses

CS 371R: Information Retrieval and Web Search

Raymond J. Mooney (University of Texas at Austin).

Courses

CS 435: Information Retrieval, Discovery, and D...

Andrea LaPaugh (Princeton University).

Courses

INF384H / CS395T / INF350E: Concepts of Informa...

Matthew Lease (University of Texas at Austin).

Courses

Information Retrieval and Data Mining

Dr. Jilles Vreeken , Prof. Dr. Gerhard Weikum (MPI).

Courses

SIMS 240: Principles of Information Retrieval

Ray R. Larson (UC berkeley).

Courses

Datasets(44 items)

20 Newsgroup dataset

This data set consists of 20000 newsgroup messages.posts taken from 20 newsgroup topics.

Datasets

Advanced Cross Linugal Information Retrieval an...

The dataset is used for the task of cross-lingual question answering but the complexity of the task is higher than CLQA dataset.

Datasets

Blog

Explore information seeking behavior in the blogosphere.

Datasets

Chemical IR

Address challenges in building large chemical testbeds for chemical IR.

Datasets

Clinical Decision Support

Investigate techniques to link medical cases to information relevant for patient care.

Datasets

CLIR Test Collections

This dataset can be used for cross lingual IR between CJKE (Chinese-Japanese-Korean-English) languages. It is suitable for the following tasks:

Datasets

CMU List

Datasets

Conference and Labs of the Evaluation Forum (CL...

It contains a multi-lingual document collection. The test suite includes:

Datasets

Confusion

Study Known Item Searching problem.

Datasets

Contextual Suggestion

Investigate search techniques for complex information needs (context and user interests based).

Datasets

Cranfield Collections

This is one of the first collections in IR domain, however the dataset is too small for any statistical significance analysis, but is nevertheless suitable for pilot runs.

Datasets

Cross Language Q&A (CLQA) dataset collection

It supports following bi-lingua and mono-lingua:

Datasets

Crowdsourcing

Explore crowdsourcing methods for performing and evaluating search.

Datasets

DBPedia

Linked data web.

Datasets

Document Understanding Conference (DUC) datasets

Past newswire/paper datasets (DUC 2001 - DUC 2007) are available upon request.

Datasets

English Gigaword Fifth Edition

This data set is a comprehensive archive of English newswire text data including headlines, datelines and articles.

Datasets

Enterprise

Study search over the organization data.

Datasets

Entity

Perform entity-related search (find entities and their properties) on Web data.

Datasets

Federated Web Search

Study merge performance for results from various search services.

Datasets

Filtering

Binarily decide retrieval of new incoming documents given a stable information need.

Datasets

Genomics

Study retrieval efficiency of genomics data and corresponding documentation.

Datasets

GOV2 Test Collection

This is one of the largest Web collection of documents obtained from crawl of government websites by Charlie Clarke and Ian Soboroff, using NIST hardware and network, then formatted by Nick Craswel.

Datasets

HARD

Obtain High Accuracy Retrieval from Documents by leveraging searcher's context.

Datasets

Interactive Track

Study user interaction with text retrieval systems.

Datasets

Knowledge base acceleration

Study algorithms that improve efficiency of human Knowledge Base.

Datasets

Legal Track

Study retrieval systems that have high recall for legal documents use case.

Datasets

Medical Track

Explore unstructured search performance over patients record data.

Datasets

Microblog Track

Examine satisfaction of real-time information need for microblogging sites.

Datasets

Million Query Track

Explore ad-hoc retrieval over large set of queries.

Datasets

Novelty Track

Investigate systems' abilities to locate new (non-redundant) information.

Datasets

NTCIR Test Collection

This is collection of wide variety of dataset ranging from Ad-hoc collection, Chinese IR collection, mobile clickthrough collections to medical collections. The focus of this collection is mostly on east asian languages and cross language information retrieval.

Datasets

Question Answering Track

Test systems that scale beyond document retrieval, to retrieve answers to factoid, list and definition type questions.

Datasets

Relevance Feedback Track

For deep evaluation of relevance feedback processes.

Datasets

Reuters Corpora

The corpora is now available through NIST. The corpora includes following:

Datasets

Robust Track

Study individual topic's effectiveness.

Datasets

Session Track

Develop methods for measuring multiple-query sessions where information needs drift.

Datasets

SPAM Track

Benchmark spam filtering approaches.

Datasets

Stanford List

Datasets

Tasks Track

Test if systems can induce possible tasks, users might be trying to accomplish for the query.

Datasets

Temporal Summarization Track

Develop systems that allow users to efficiently monitor the information associated with an event over time.

Datasets

Terabyte Track

Test scalability of IR systems to large scale collection.

Datasets

TREC Collections

TREC is the benchmark dataset used by most IR and Web search algorithms. It has several tracks, each of which consists of dataset to test for a specific task. The tracks along with suggested use-case are:

Datasets

University of Tennesse Knoxville

Datasets

Web Track

Explore information seeking behaviors common in general web search.

Datasets

Software(4 items)

Apache Lucene

Open Source Search Engine that can be used to test Information Retrieval Algorithm. Twitter uses this core for its real-time search.

Software

Indri Search Engine

Another Open Source Search Engine competitor of Apache Lucene.

Software

Lemur Toolkit

Open Source Toolkit for research in Language Modeling, filtering and categorization.

Software

The Lemur Project

The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software.

Software

Talks(14 items)

Beware online "filter bubbles"

Eli Pariser (Author of the Filter Bubble, TED Talk).

Talks

Challenges in Building Large-Scale Information ...

Jeff Dean (WSDM Conference, 2009).

Talks

Do we have the right to be forgotten?

Michael Douglas [TEDx SouthBank].

Talks

Extreme Classification: A New Paradigm for Rank...

Manik Verma (Microsoft Research)

Talks

Information Experience - Solution to Informatio...

Doug Imbruce (Techcrunch Disrupt)[Doug Imbruce is the Founder of Qwiki, Inc, a technology startup in New York, NY, acquired by Yahoo! in 2013].

Talks

Internet Privacy

Dr. Alma Whitten (Google Brussels Tech Talk).

Talks

Is Pivot a turning point for web exploration?

Gary Flake, Technical Fellow at Microsoft (TED Talks).

Talks

Knowledge-based Information Retrieval with Wiki...

David Wilne (The University of Waikato, 2008).

Talks

Music Information Retrieval Using Locality Sens...

Steve Tjoa (RackSpace Developers) [This talk shows that IR is not just text and images].

Talks

The case for anonymity online

Christopher "moot" Poole" (Ted Talks) [Christopher "moot" Poole is founder of 4chan, an online imageboard whose anonymous denizens have spawned the web's most bewildering and influential subculture].

Talks

The Functional Web -- The Future of Apps and th...

Liron Shapira (Box Tech Talk).

Talks

The moral bias behind your search results

Andreas Ekström (Swedish Author & Journalist, TED Talk).

Talks

The next web

Tim Berners-Lee (Ted Talk) [Tim Berners-Lee invented the World Wide Web. He leads the World Wide Web Consortium (W3C), overseeing the Web's standards and development].

Talks

Think your email's private? Think again

Andy Yen (CERN, TED Talk) [This talk talks about privacy, which Search Engines intrude into, and how can people protect it].

Talks