A large-scale study of robots.txt

Yang Sun, Ziming Zhuang, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

24 Citations (Scopus)

Abstract

Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Although it is not an enforcement standard, ethical robots (and many commercial) will follow the rules specified in robots.txt. With our focused crawler, we investigate 7,593 websites from education, government, news, and business domains. Five crawls have been conducted in succession to study the temporal changes. Through statistical analysis of the data, we present a survey of the usage of Web robots rules at the Web scale. The results also show that the usage of robots.txt has increased over time.

Original languageEnglish (US)
Title of host publication16th International World Wide Web Conference, WWW2007
Pages1123-1124
Number of pages2
DOIs
StatePublished - Oct 22 2007
Event16th International World Wide Web Conference, WWW2007 - Banff, AB, Canada
Duration: May 8 2007May 12 2007

Publication series

Name16th International World Wide Web Conference, WWW2007

Other

Other16th International World Wide Web Conference, WWW2007
CountryCanada
CityBanff, AB
Period5/8/075/12/07

Fingerprint

Software agents
Robots
Search engines
Websites
Statistical methods
Servers
Education
Network protocols
Industry

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Software

Cite this

Sun, Y., Zhuang, Z., & Giles, C. L. (2007). A large-scale study of robots.txt. In 16th International World Wide Web Conference, WWW2007 (pp. 1123-1124). (16th International World Wide Web Conference, WWW2007). https://doi.org/10.1145/1242572.1242726
Sun, Yang ; Zhuang, Ziming ; Giles, C. Lee. / A large-scale study of robots.txt. 16th International World Wide Web Conference, WWW2007. 2007. pp. 1123-1124 (16th International World Wide Web Conference, WWW2007).
@inproceedings{04fe5f6f781d4edb964a7505d44ebfc1,
title = "A large-scale study of robots.txt",
abstract = "Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Although it is not an enforcement standard, ethical robots (and many commercial) will follow the rules specified in robots.txt. With our focused crawler, we investigate 7,593 websites from education, government, news, and business domains. Five crawls have been conducted in succession to study the temporal changes. Through statistical analysis of the data, we present a survey of the usage of Web robots rules at the Web scale. The results also show that the usage of robots.txt has increased over time.",
author = "Yang Sun and Ziming Zhuang and Giles, {C. Lee}",
year = "2007",
month = "10",
day = "22",
doi = "10.1145/1242572.1242726",
language = "English (US)",
isbn = "1595936548",
series = "16th International World Wide Web Conference, WWW2007",
pages = "1123--1124",
booktitle = "16th International World Wide Web Conference, WWW2007",

}

Sun, Y, Zhuang, Z & Giles, CL 2007, A large-scale study of robots.txt. in 16th International World Wide Web Conference, WWW2007. 16th International World Wide Web Conference, WWW2007, pp. 1123-1124, 16th International World Wide Web Conference, WWW2007, Banff, AB, Canada, 5/8/07. https://doi.org/10.1145/1242572.1242726

A large-scale study of robots.txt. / Sun, Yang; Zhuang, Ziming; Giles, C. Lee.

16th International World Wide Web Conference, WWW2007. 2007. p. 1123-1124 (16th International World Wide Web Conference, WWW2007).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - A large-scale study of robots.txt

AU - Sun, Yang

AU - Zhuang, Ziming

AU - Giles, C. Lee

PY - 2007/10/22

Y1 - 2007/10/22

N2 - Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Although it is not an enforcement standard, ethical robots (and many commercial) will follow the rules specified in robots.txt. With our focused crawler, we investigate 7,593 websites from education, government, news, and business domains. Five crawls have been conducted in succession to study the temporal changes. Through statistical analysis of the data, we present a survey of the usage of Web robots rules at the Web scale. The results also show that the usage of robots.txt has increased over time.

AB - Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the Robots Exclusion Protocol in a file called robots.txt. Although it is not an enforcement standard, ethical robots (and many commercial) will follow the rules specified in robots.txt. With our focused crawler, we investigate 7,593 websites from education, government, news, and business domains. Five crawls have been conducted in succession to study the temporal changes. Through statistical analysis of the data, we present a survey of the usage of Web robots rules at the Web scale. The results also show that the usage of robots.txt has increased over time.

UR - http://www.scopus.com/inward/record.url?scp=35348856355&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35348856355&partnerID=8YFLogxK

U2 - 10.1145/1242572.1242726

DO - 10.1145/1242572.1242726

M3 - Conference contribution

AN - SCOPUS:35348856355

SN - 1595936548

SN - 9781595936547

T3 - 16th International World Wide Web Conference, WWW2007

SP - 1123

EP - 1124

BT - 16th International World Wide Web Conference, WWW2007

ER -

Sun Y, Zhuang Z, Giles CL. A large-scale study of robots.txt. In 16th International World Wide Web Conference, WWW2007. 2007. p. 1123-1124. (16th International World Wide Web Conference, WWW2007). https://doi.org/10.1145/1242572.1242726