A large-scale study of robots.txt

Yang Sun, Ziming Zhuang, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

25 Scopus citations

Abstract

Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Although it is not an enforcement standard, ethical robots (and many commercial) will follow the rules specified in robots.txt. With our focused crawler, we investigate 7,593 websites from education, government, news, and business domains. Five crawls have been conducted in succession to study the temporal changes. Through statistical analysis of the data, we present a survey of the usage of Web robots rules at the Web scale. The results also show that the usage of robots.txt has increased over time.

Original languageEnglish (US)
Title of host publication16th International World Wide Web Conference, WWW2007
Pages1123-1124
Number of pages2
DOIs
StatePublished - 2007
Event16th International World Wide Web Conference, WWW2007 - Banff, AB, Canada
Duration: May 8 2007May 12 2007

Publication series

Name16th International World Wide Web Conference, WWW2007

Other

Other16th International World Wide Web Conference, WWW2007
CountryCanada
CityBanff, AB
Period5/8/075/12/07

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Software

Fingerprint Dive into the research topics of 'A large-scale study of robots.txt'. Together they form a unique fingerprint.

Cite this