Abstract
This paper identifies and explores the problem of seed selection in a web-scale crawler. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a collection with more "good" and less "bad" pages. Based on the analysis of the graph structure of the web, we propose several seed selection algorithms. Effectiveness of these algorithms is proved by our experimental results on real web data. Copyright is held by the author/owner(s).
Original language | English (US) |
---|---|
Title of host publication | WWW'09 - Proceedings of the 18th International World Wide Web Conference |
Pages | 1089-1090 |
Number of pages | 2 |
DOIs | |
State | Published - Dec 1 2009 |
Event | 18th International World Wide Web Conference, WWW 2009 - Madrid, Spain Duration: Apr 20 2009 → Apr 24 2009 |
Other
Other | 18th International World Wide Web Conference, WWW 2009 |
---|---|
Country/Territory | Spain |
City | Madrid |
Period | 4/20/09 → 4/24/09 |
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications