Focused crawling using context graphs

M. Diligentit, F. M. Coetzee, S. Lawrence, C. L. Giles, M. Gori

Research output: Chapter in Book/Report/Conference proceedingConference contribution

358 Citations (Scopus)

Abstract

Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size and dynamic content of the web. Focused crawlers aim to search only the subset of the web related to a specific category, and offer a potential solution to the currency problem. The major problem in focused crawling is performing appropriate credit assignment to different documents along a crawl path, such that short-term gains are not pursued at the expense of less-obvious crawl paths that ultimately yield larger sets of valuable pages. To address this problem we present a focused crawling algorithm that builds a model for the context within which topically relevant pages occur on the web. This context model can capture typical link hierarchies within which valuable pages occur, as well as model content on documents that frequently cooccur with relevant pages. Our algorithm further leverages the existing capability of large search engines to provide partial reverse crawling capabilities. Our algorithm shows significant performance improvements in crawling efficiency over standard focused crawling.

Original languageEnglish (US)
Title of host publicationProceedings of the 26th International Conference on Very Large Data Bases, VLDB'00
Pages527-534
Number of pages8
StatePublished - Dec 1 2000
Event26th International Conference on Very Large Data Bases, VLDB 2000 - Cairo, Egypt
Duration: Sep 10 2000Sep 14 2000

Publication series

NameProceedings of the 26th International Conference on Very Large Data Bases, VLDB'00

Other

Other26th International Conference on Very Large Data Bases, VLDB 2000
CountryEgypt
CityCairo
Period9/10/009/14/00

Fingerprint

Search engines
Graph
World Wide Web
Currency
Search engine
Performance improvement
Leverage
Assignment
Expenses
Credit

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture
  • Information Systems
  • Software
  • Information Systems and Management

Cite this

Diligentit, M., Coetzee, F. M., Lawrence, S., Giles, C. L., & Gori, M. (2000). Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00 (pp. 527-534). (Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00).
Diligentit, M. ; Coetzee, F. M. ; Lawrence, S. ; Giles, C. L. ; Gori, M. / Focused crawling using context graphs. Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00. 2000. pp. 527-534 (Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00).
@inproceedings{2049f32dda4f4136a93c59ca67ae2a87,
title = "Focused crawling using context graphs",
abstract = "Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size and dynamic content of the web. Focused crawlers aim to search only the subset of the web related to a specific category, and offer a potential solution to the currency problem. The major problem in focused crawling is performing appropriate credit assignment to different documents along a crawl path, such that short-term gains are not pursued at the expense of less-obvious crawl paths that ultimately yield larger sets of valuable pages. To address this problem we present a focused crawling algorithm that builds a model for the context within which topically relevant pages occur on the web. This context model can capture typical link hierarchies within which valuable pages occur, as well as model content on documents that frequently cooccur with relevant pages. Our algorithm further leverages the existing capability of large search engines to provide partial reverse crawling capabilities. Our algorithm shows significant performance improvements in crawling efficiency over standard focused crawling.",
author = "M. Diligentit and Coetzee, {F. M.} and S. Lawrence and Giles, {C. L.} and M. Gori",
year = "2000",
month = "12",
day = "1",
language = "English (US)",
isbn = "1558607153",
series = "Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00",
pages = "527--534",
booktitle = "Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00",

}

Diligentit, M, Coetzee, FM, Lawrence, S, Giles, CL & Gori, M 2000, Focused crawling using context graphs. in Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00. Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00, pp. 527-534, 26th International Conference on Very Large Data Bases, VLDB 2000, Cairo, Egypt, 9/10/00.

Focused crawling using context graphs. / Diligentit, M.; Coetzee, F. M.; Lawrence, S.; Giles, C. L.; Gori, M.

Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00. 2000. p. 527-534 (Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Focused crawling using context graphs

AU - Diligentit, M.

AU - Coetzee, F. M.

AU - Lawrence, S.

AU - Giles, C. L.

AU - Gori, M.

PY - 2000/12/1

Y1 - 2000/12/1

N2 - Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size and dynamic content of the web. Focused crawlers aim to search only the subset of the web related to a specific category, and offer a potential solution to the currency problem. The major problem in focused crawling is performing appropriate credit assignment to different documents along a crawl path, such that short-term gains are not pursued at the expense of less-obvious crawl paths that ultimately yield larger sets of valuable pages. To address this problem we present a focused crawling algorithm that builds a model for the context within which topically relevant pages occur on the web. This context model can capture typical link hierarchies within which valuable pages occur, as well as model content on documents that frequently cooccur with relevant pages. Our algorithm further leverages the existing capability of large search engines to provide partial reverse crawling capabilities. Our algorithm shows significant performance improvements in crawling efficiency over standard focused crawling.

AB - Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size and dynamic content of the web. Focused crawlers aim to search only the subset of the web related to a specific category, and offer a potential solution to the currency problem. The major problem in focused crawling is performing appropriate credit assignment to different documents along a crawl path, such that short-term gains are not pursued at the expense of less-obvious crawl paths that ultimately yield larger sets of valuable pages. To address this problem we present a focused crawling algorithm that builds a model for the context within which topically relevant pages occur on the web. This context model can capture typical link hierarchies within which valuable pages occur, as well as model content on documents that frequently cooccur with relevant pages. Our algorithm further leverages the existing capability of large search engines to provide partial reverse crawling capabilities. Our algorithm shows significant performance improvements in crawling efficiency over standard focused crawling.

UR - http://www.scopus.com/inward/record.url?scp=70350672544&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70350672544&partnerID=8YFLogxK

M3 - Conference contribution

SN - 1558607153

SN - 9781558607156

T3 - Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00

SP - 527

EP - 534

BT - Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00

ER -

Diligentit M, Coetzee FM, Lawrence S, Giles CL, Gori M. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00. 2000. p. 527-534. (Proceedings of the 26th International Conference on Very Large Data Bases, VLDB'00).