Designing efficient sampling techniques to detect webpage updates

Qingzhao Tan, Ziming Zhuang, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the entire local repository synchronized with the Web. We advance the state-of-art of the sampling-based synchronization techniques by answering a challenging question: Given a sampled webpage and its change status, which other webpages are also likely to change? We present a study of various downloading granularities and policies, and propose an adaptive model based on the update history and the popularity of the webpages. We run extensive experiments on a large dataset of approximately 300,000 webpages to demonstrate that it is most likely to find more updated webpages in the current or upper directories of the changed samples. Moreover, the adaptive strategies outperform the non-adaptive one in terms of detecting important changes.

Original languageEnglish (US)
Title of host publication16th International World Wide Web Conference, WWW2007
Pages1147-1148
Number of pages2
DOIs
StatePublished - Oct 22 2007
Event16th International World Wide Web Conference, WWW2007 - Banff, AB, Canada
Duration: May 8 2007May 12 2007

Publication series

Name16th International World Wide Web Conference, WWW2007

Other

Other16th International World Wide Web Conference, WWW2007
CountryCanada
CityBanff, AB
Period5/8/075/12/07

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Software

Fingerprint Dive into the research topics of 'Designing efficient sampling techniques to detect webpage updates'. Together they form a unique fingerprint.

Cite this