From data collection to analysis - Exploring regional linguistic variation in route directions by spatially-stratified web sampling

Sen Xu, Anuj Jaiswal, Xiao Zhang, Alexander Klippel, Prasenjit Mitra, Alan Maceachren

Research output: Contribution to journalConference article

2 Citations (Scopus)

Abstract

How spatial language varies regionally? This study investigates the possibility of exploring regional linguistic variations in spatial language by collecting and analyzing a Spatially-strAtified Route Direction Corpus (SARD Corpus) from volunteered spatial language text on the Web. Because of the fast content sharing functionality of the World Wide Web, it quickly becomes a hotbed for volunteered spatial language text, such as directions on hotels' Websites. These route directions can serve as a representation of everyday spatial language usage on the WWW. The spatial coverage and abundance of the data source is appealing while collecting and analyzing large quantities of spatially distributed data is still challenging. Through automated crawling, classifying and geo-referencing web documents containing route directions from the web, the SARD Corpus has been built covering the U.S., the U.K. and Australia. We implement a semantic categorical analysis scheme to explore regional variations in cardinal versus relative direction usages. Preliminary results show both similarity and differences at national level and geographic patterns at regional level. The design and implementation of building a geo-referenced large-scale corpus from Web documents offers a methodological contribution to corpus linguistics, spatial cognition, and the GISciences.

Original languageEnglish (US)
Pages (from-to)49-52
Number of pages4
JournalCEUR Workshop Proceedings
Volume620
StatePublished - Dec 1 2010
EventWorkshop on Computational Models of Spatial Language Interpretation at Spatial Cognition 2010, COSLI 2010 - Portland, OR, United States
Duration: Aug 15 2010Aug 15 2010

Fingerprint

Linguistics
World Wide Web
Sampling
Hotels
Websites
Semantics

All Science Journal Classification (ASJC) codes

  • Computer Science(all)

Cite this

@article{de97449553c146f7aeb8dba75529ff42,
title = "From data collection to analysis - Exploring regional linguistic variation in route directions by spatially-stratified web sampling",
abstract = "How spatial language varies regionally? This study investigates the possibility of exploring regional linguistic variations in spatial language by collecting and analyzing a Spatially-strAtified Route Direction Corpus (SARD Corpus) from volunteered spatial language text on the Web. Because of the fast content sharing functionality of the World Wide Web, it quickly becomes a hotbed for volunteered spatial language text, such as directions on hotels' Websites. These route directions can serve as a representation of everyday spatial language usage on the WWW. The spatial coverage and abundance of the data source is appealing while collecting and analyzing large quantities of spatially distributed data is still challenging. Through automated crawling, classifying and geo-referencing web documents containing route directions from the web, the SARD Corpus has been built covering the U.S., the U.K. and Australia. We implement a semantic categorical analysis scheme to explore regional variations in cardinal versus relative direction usages. Preliminary results show both similarity and differences at national level and geographic patterns at regional level. The design and implementation of building a geo-referenced large-scale corpus from Web documents offers a methodological contribution to corpus linguistics, spatial cognition, and the GISciences.",
author = "Sen Xu and Anuj Jaiswal and Xiao Zhang and Alexander Klippel and Prasenjit Mitra and Alan Maceachren",
year = "2010",
month = "12",
day = "1",
language = "English (US)",
volume = "620",
pages = "49--52",
journal = "CEUR Workshop Proceedings",
issn = "1613-0073",
publisher = "CEUR-WS",

}

TY - JOUR

T1 - From data collection to analysis - Exploring regional linguistic variation in route directions by spatially-stratified web sampling

AU - Xu, Sen

AU - Jaiswal, Anuj

AU - Zhang, Xiao

AU - Klippel, Alexander

AU - Mitra, Prasenjit

AU - Maceachren, Alan

PY - 2010/12/1

Y1 - 2010/12/1

N2 - How spatial language varies regionally? This study investigates the possibility of exploring regional linguistic variations in spatial language by collecting and analyzing a Spatially-strAtified Route Direction Corpus (SARD Corpus) from volunteered spatial language text on the Web. Because of the fast content sharing functionality of the World Wide Web, it quickly becomes a hotbed for volunteered spatial language text, such as directions on hotels' Websites. These route directions can serve as a representation of everyday spatial language usage on the WWW. The spatial coverage and abundance of the data source is appealing while collecting and analyzing large quantities of spatially distributed data is still challenging. Through automated crawling, classifying and geo-referencing web documents containing route directions from the web, the SARD Corpus has been built covering the U.S., the U.K. and Australia. We implement a semantic categorical analysis scheme to explore regional variations in cardinal versus relative direction usages. Preliminary results show both similarity and differences at national level and geographic patterns at regional level. The design and implementation of building a geo-referenced large-scale corpus from Web documents offers a methodological contribution to corpus linguistics, spatial cognition, and the GISciences.

AB - How spatial language varies regionally? This study investigates the possibility of exploring regional linguistic variations in spatial language by collecting and analyzing a Spatially-strAtified Route Direction Corpus (SARD Corpus) from volunteered spatial language text on the Web. Because of the fast content sharing functionality of the World Wide Web, it quickly becomes a hotbed for volunteered spatial language text, such as directions on hotels' Websites. These route directions can serve as a representation of everyday spatial language usage on the WWW. The spatial coverage and abundance of the data source is appealing while collecting and analyzing large quantities of spatially distributed data is still challenging. Through automated crawling, classifying and geo-referencing web documents containing route directions from the web, the SARD Corpus has been built covering the U.S., the U.K. and Australia. We implement a semantic categorical analysis scheme to explore regional variations in cardinal versus relative direction usages. Preliminary results show both similarity and differences at national level and geographic patterns at regional level. The design and implementation of building a geo-referenced large-scale corpus from Web documents offers a methodological contribution to corpus linguistics, spatial cognition, and the GISciences.

UR - http://www.scopus.com/inward/record.url?scp=84889004553&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84889004553&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:84889004553

VL - 620

SP - 49

EP - 52

JO - CEUR Workshop Proceedings

JF - CEUR Workshop Proceedings

SN - 1613-0073

ER -