Mining a search engine's corpus without a query pool

Mingyang Zhang, Nan Zhang, Gautam Das

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations

Abstract

Many websites (e.g., WedMD.com, CNN.com) provide keyword search interfaces over a large corpus of documents. Meanwhile, many third parties (e.g., investors, analysts) are interested in learning big-picture analytical information over such a document corpus, but have no direct way of accessing it other than using the highly restrictive web search interface. In this paper, we study how to enable third-party data analytics over a search engine's corpus without the cooperation of its owner - specifically, by issuing a small number of search queries through the web interface. Almost all existing techniques require a pre-constructed query pool - i.e., a small yet comprehensive collection of queries which, if all issued through the search interface, can recall almost all documents in the corpus. The problem with this requirement is that a "good" query pool can only be constructed by someone with very specific knowledge (e.g., size, topic, special terms used, etc.) of the corpus, essentially leading to a chicken-and-egg problem. In this paper, we develop QG-SAMPLER and QG-ESTIMATOR, the first practical pool-free techniques for sampling and aggregate (e.g., SUM, COUNT, AVG) estimation over a search engine's corpus, respectively. Extensive real-world experiments show that our algorithms perform on-par with the state-of-the-art pool-based techniques equipped with a carefully tailored query pool, and significantly outperforms the latter when the query pool is a mismatch.

Original languageEnglish (US)
Title of host publicationCIKM 2013 - Proceedings of the 22nd ACM International Conference on Information and Knowledge Management
Pages29-38
Number of pages10
DOIs
StatePublished - Dec 11 2013
Event22nd ACM International Conference on Information and Knowledge Management, CIKM 2013 - San Francisco, CA, United States
Duration: Oct 27 2013Nov 1 2013

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Other

Other22nd ACM International Conference on Information and Knowledge Management, CIKM 2013
CountryUnited States
CitySan Francisco, CA
Period10/27/1311/1/13

All Science Journal Classification (ASJC) codes

  • Decision Sciences(all)
  • Business, Management and Accounting(all)

Fingerprint Dive into the research topics of 'Mining a search engine's corpus without a query pool'. Together they form a unique fingerprint.

  • Cite this

    Zhang, M., Zhang, N., & Das, G. (2013). Mining a search engine's corpus without a query pool. In CIKM 2013 - Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (pp. 29-38). (International Conference on Information and Knowledge Management, Proceedings). https://doi.org/10.1145/2505515.2505748