Analysis of lexical signatures for improving information persistence on the world wide web

Seung Taek Park, David M. Pennock, C. Lee Giles, Robert Krovetz

Research output: Contribution to journalReview article

22 Citations (Scopus)

Abstract

A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator called Test & Select (TS) to mitigate LS conflict. TS outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines. All LS methods show significant performance degradation as documents in the corpus are edited.

Original languageEnglish (US)
Pages (from-to)540-572
Number of pages33
JournalACM Transactions on Information Systems
Volume22
Issue number4
DOIs
StatePublished - Oct 1 2004

Fingerprint

Search engines
World Wide Web
Websites
Degradation
Persistence

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Business, Management and Accounting(all)
  • Computer Science Applications

Cite this

Park, Seung Taek ; Pennock, David M. ; Giles, C. Lee ; Krovetz, Robert. / Analysis of lexical signatures for improving information persistence on the world wide web. In: ACM Transactions on Information Systems. 2004 ; Vol. 22, No. 4. pp. 540-572.
@article{e7cd0353f29b47249d6ec2e8275f6e47,
title = "Analysis of lexical signatures for improving information persistence on the world wide web",
abstract = "A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator called Test & Select (TS) to mitigate LS conflict. TS outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines. All LS methods show significant performance degradation as documents in the corpus are edited.",
author = "Park, {Seung Taek} and Pennock, {David M.} and Giles, {C. Lee} and Robert Krovetz",
year = "2004",
month = "10",
day = "1",
doi = "10.1145/1028099.1028101",
language = "English (US)",
volume = "22",
pages = "540--572",
journal = "ACM Transactions on Information Systems",
issn = "1046-8188",
publisher = "Association for Computing Machinery (ACM)",
number = "4",

}

Analysis of lexical signatures for improving information persistence on the world wide web. / Park, Seung Taek; Pennock, David M.; Giles, C. Lee; Krovetz, Robert.

In: ACM Transactions on Information Systems, Vol. 22, No. 4, 01.10.2004, p. 540-572.

Research output: Contribution to journalReview article

TY - JOUR

T1 - Analysis of lexical signatures for improving information persistence on the world wide web

AU - Park, Seung Taek

AU - Pennock, David M.

AU - Giles, C. Lee

AU - Krovetz, Robert

PY - 2004/10/1

Y1 - 2004/10/1

N2 - A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator called Test & Select (TS) to mitigate LS conflict. TS outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines. All LS methods show significant performance degradation as documents in the corpus are edited.

AB - A lexical signature (LS) consisting of several key words from a Web document is often sufficient information for finding the document later, even if its URL has changed. We conduct a large-scale empirical study of nine methods for generating lexical signatures, including Phelps and Wilensky's original proposal (PW), seven of our own static variations, and one new dynamic method. We examine their performance on the Web over a 10-month period, and on a TREC data set, evaluating their ability to both (1) uniquely identify the original (possibly modified) document, and (2) locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the Web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. The term-frequency inverse-document-frequency- (TFIDF-) based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates among static methods for generating effective lexical signatures. We propose a dynamic LS generator called Test & Select (TS) to mitigate LS conflict. TS outperforms all eight static methods in terms of both extracting the desired document and finding relevant information, over three different search engines. All LS methods show significant performance degradation as documents in the corpus are edited.

UR - http://www.scopus.com/inward/record.url?scp=9144269133&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=9144269133&partnerID=8YFLogxK

U2 - 10.1145/1028099.1028101

DO - 10.1145/1028099.1028101

M3 - Review article

AN - SCOPUS:9144269133

VL - 22

SP - 540

EP - 572

JO - ACM Transactions on Information Systems

JF - ACM Transactions on Information Systems

SN - 1046-8188

IS - 4

ER -