Information Retrieval from Lip Reading.
Information Extraction.
Google Research.
Dr. Edel Garcia.
Information Extraction.
Bias in Relevance.
Sketch-based search.
B. Croft, D. Metzler, T. Strohman. Pearson Education, 2009.
Ed Greengrass, 2000. (Comprehensive survey of Conventional Information Retrieval, before Deep Learning era).
C.D. Manning, P. Raghavan, H. Schütze. Cambridge UP, 2008. (First book for getting started with Information Retrieval).
G.G. Chowdhury. Neal-Schuman, 2003. (Intended for students of library and information studies).
W.B. Croft, J. Lafferty. Springer, 2003. (Handles Language Modeling aspect of Information Retrieval. It also extensively details probabilistic perspective in this domain, which is interesting).
S. Chakrabarti. Morgan Kaufmann, 2002.
R. Baeza-Yates, B. Ribeiro-Neto. Addison-Wesley, 1999.
Bruce Croft, Don Metzler, and Trevor Strohman. 2009. (Great book for readers interested in knowing how Search Engines work. The book is very detailed).
C.T. Meadow, B.R. Boyce, D.H. Kraft, C.L. Barry. Academic Press, 2007 (library/information science perspective).
Conference on Information and Knowledge Management - .
Conference and Labs of the Evaluation Forum - .
European Conference on Information Retrieval - .
Forum for Information Retrieval Evaluation - .
NII Testsbeds and Community for Information access Research - .
Special Interests Group on Information Retrieval - .
Text REtrieval Conference - .
Web Search and Data Mining Conference - .
World Wide Web Conference - .
Jamie Callan (CMU).
David Yarowsky (John Hopkins University).
Prof. ChengXiang Zhai (University of Illinois at Urbana-Champaign).
Vagelis Hristidis (University of California - Riverside).
Chris Manning and Pandu Nayak (Stanford University).
Raymond J. Mooney (University of Texas at Austin).
Andrea LaPaugh (Princeton University).
Matthew Lease (University of Texas at Austin).
Dr. Jilles Vreeken , Prof. Dr. Gerhard Weikum (MPI).
Ray R. Larson (UC berkeley).
This data set consists of 20000 newsgroup messages.posts taken from 20 newsgroup topics.
The dataset is used for the task of cross-lingual question answering but the complexity of the task is higher than CLQA dataset.
Explore information seeking behavior in the blogosphere.
Address challenges in building large chemical testbeds for chemical IR.
Investigate techniques to link medical cases to information relevant for patient care.
This dataset can be used for cross lingual IR between CJKE (Chinese-Japanese-Korean-English) languages. It is suitable for the following tasks:
It contains a multi-lingual document collection. The test suite includes:
Study Known Item Searching problem.
Investigate search techniques for complex information needs (context and user interests based).
This is one of the first collections in IR domain, however the dataset is too small for any statistical significance analysis, but is nevertheless suitable for pilot runs.
It supports following bi-lingua and mono-lingua:
Explore crowdsourcing methods for performing and evaluating search.
Linked data web.
Past newswire/paper datasets (DUC 2001 - DUC 2007) are available upon request.
This data set is a comprehensive archive of English newswire text data including headlines, datelines and articles.
Study search over the organization data.
Perform entity-related search (find entities and their properties) on Web data.
Study merge performance for results from various search services.
Binarily decide retrieval of new incoming documents given a stable information need.
Study retrieval efficiency of genomics data and corresponding documentation.
This is one of the largest Web collection of documents obtained from crawl of government websites by Charlie Clarke and Ian Soboroff, using NIST hardware and network, then formatted by Nick Craswel.
Obtain High Accuracy Retrieval from Documents by leveraging searcher's context.
Study user interaction with text retrieval systems.
Study algorithms that improve efficiency of human Knowledge Base.
Study retrieval systems that have high recall for legal documents use case.
Explore unstructured search performance over patients record data.
Examine satisfaction of real-time information need for microblogging sites.
Explore ad-hoc retrieval over large set of queries.
Investigate systems' abilities to locate new (non-redundant) information.
This is collection of wide variety of dataset ranging from Ad-hoc collection, Chinese IR collection, mobile clickthrough collections to medical collections. The focus of this collection is mostly on east asian languages and cross language information retrieval.
Test systems that scale beyond document retrieval, to retrieve answers to factoid, list and definition type questions.
For deep evaluation of relevance feedback processes.
The corpora is now available through NIST. The corpora includes following:
Study individual topic's effectiveness.
Develop methods for measuring multiple-query sessions where information needs drift.
Benchmark spam filtering approaches.
Test if systems can induce possible tasks, users might be trying to accomplish for the query.
Develop systems that allow users to efficiently monitor the information associated with an event over time.
Test scalability of IR systems to large scale collection.
TREC is the benchmark dataset used by most IR and Web search algorithms. It has several tracks, each of which consists of dataset to test for a specific task. The tracks along with suggested use-case are:
Explore information seeking behaviors common in general web search.
Open Source Search Engine that can be used to test Information Retrieval Algorithm. Twitter uses this core for its real-time search.
Another Open Source Search Engine competitor of Apache Lucene.
Open Source Toolkit for research in Language Modeling, filtering and categorization.
The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software.
Eli Pariser (Author of the Filter Bubble, TED Talk).
Jeff Dean (WSDM Conference, 2009).
Michael Douglas [TEDx SouthBank].
Manik Verma (Microsoft Research)
Doug Imbruce (Techcrunch Disrupt)[Doug Imbruce is the Founder of Qwiki, Inc, a technology startup in New York, NY, acquired by Yahoo! in 2013].
Dr. Alma Whitten (Google Brussels Tech Talk).
Gary Flake, Technical Fellow at Microsoft (TED Talks).
David Wilne (The University of Waikato, 2008).
Steve Tjoa (RackSpace Developers) [This talk shows that IR is not just text and images].
Christopher "moot" Poole" (Ted Talks) [Christopher "moot" Poole is founder of 4chan, an online imageboard whose anonymous denizens have spawned the web's most bewildering and influential subculture].
Liron Shapira (Box Tech Talk).
Andreas Ekström (Swedish Author & Journalist, TED Talk).
Tim Berners-Lee (Ted Talk) [Tim Berners-Lee invented the World Wide Web. He leads the World Wide Web Consortium (W3C), overseeing the Web's standards and development].
Andy Yen (CERN, TED Talk) [This talk talks about privacy, which Search Engines intrude into, and how can people protect it].