SearchGen: A synthetic workload generator for scientific literature digital libraries and search engines

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

Due to the popularity of web applications and their heavy usage, it is important to obtain a good understanding of their workloads in order to improve performance of search services. Existing works have typically focused on generic web workloads without putting emphasis on specific domains. In this paper, we analyze the usage logs of CiteSeer, a scientific literature digital library and search engine, to characterize workloads for both robots and users. Essential ingredients that contribute to workloads are proposed. Among them we find the access intervals show high variance, and thus cannot be predicted well with time-series models. On the other hand, client visiting path and semantics can be well captured with probabilistic models and Zipf-law. Based on the findings, we propose SearchGen, a synthetic workload generator to output traces for scientific literature digital libraries and search engines. A comparison between synthetic workloads and actual logged traces suggests that the synthetic workload fits well.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007
Subtitle of host publicationBuilding and Sustaining the Digital Environment
Pages137-146
Number of pages10
DOIs
StatePublished - Nov 29 2007
Event7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment - Vancouver, BC, Canada
Duration: Jun 18 2007Jun 23 2007

Publication series

NameProceedings of the ACM International Conference on Digital Libraries

Other

Other7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment
CountryCanada
CityVancouver, BC
Period6/18/076/23/07

Fingerprint

Digital libraries
technical literature
Search engines
workload
search engine
World Wide Web
Time series
Semantics
Robots
robot
popularity
time series
semantics
Law
Statistical Models
performance

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Computer Science Applications
  • Library and Information Sciences

Cite this

Li, H., Lee, W. C., Sivasubramaniam, A., & Giles, L. (2007). SearchGen: A synthetic workload generator for scientific literature digital libraries and search engines. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment (pp. 137-146). (Proceedings of the ACM International Conference on Digital Libraries). https://doi.org/10.1145/1255175.1255203
Li, Huajing ; Lee, Wang Chien ; Sivasubramaniam, Anand ; Giles, Lee. / SearchGen : A synthetic workload generator for scientific literature digital libraries and search engines. Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. 2007. pp. 137-146 (Proceedings of the ACM International Conference on Digital Libraries).
@inproceedings{41258b035f8d4890afa7dbddfb90091d,
title = "SearchGen: A synthetic workload generator for scientific literature digital libraries and search engines",
abstract = "Due to the popularity of web applications and their heavy usage, it is important to obtain a good understanding of their workloads in order to improve performance of search services. Existing works have typically focused on generic web workloads without putting emphasis on specific domains. In this paper, we analyze the usage logs of CiteSeer, a scientific literature digital library and search engine, to characterize workloads for both robots and users. Essential ingredients that contribute to workloads are proposed. Among them we find the access intervals show high variance, and thus cannot be predicted well with time-series models. On the other hand, client visiting path and semantics can be well captured with probabilistic models and Zipf-law. Based on the findings, we propose SearchGen, a synthetic workload generator to output traces for scientific literature digital libraries and search engines. A comparison between synthetic workloads and actual logged traces suggests that the synthetic workload fits well.",
author = "Huajing Li and Lee, {Wang Chien} and Anand Sivasubramaniam and Lee Giles",
year = "2007",
month = "11",
day = "29",
doi = "10.1145/1255175.1255203",
language = "English (US)",
isbn = "1595936440",
series = "Proceedings of the ACM International Conference on Digital Libraries",
pages = "137--146",
booktitle = "Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007",

}

Li, H, Lee, WC, Sivasubramaniam, A & Giles, L 2007, SearchGen: A synthetic workload generator for scientific literature digital libraries and search engines. in Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. Proceedings of the ACM International Conference on Digital Libraries, pp. 137-146, 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment, Vancouver, BC, Canada, 6/18/07. https://doi.org/10.1145/1255175.1255203

SearchGen : A synthetic workload generator for scientific literature digital libraries and search engines. / Li, Huajing; Lee, Wang Chien; Sivasubramaniam, Anand; Giles, Lee.

Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. 2007. p. 137-146 (Proceedings of the ACM International Conference on Digital Libraries).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - SearchGen

T2 - A synthetic workload generator for scientific literature digital libraries and search engines

AU - Li, Huajing

AU - Lee, Wang Chien

AU - Sivasubramaniam, Anand

AU - Giles, Lee

PY - 2007/11/29

Y1 - 2007/11/29

N2 - Due to the popularity of web applications and their heavy usage, it is important to obtain a good understanding of their workloads in order to improve performance of search services. Existing works have typically focused on generic web workloads without putting emphasis on specific domains. In this paper, we analyze the usage logs of CiteSeer, a scientific literature digital library and search engine, to characterize workloads for both robots and users. Essential ingredients that contribute to workloads are proposed. Among them we find the access intervals show high variance, and thus cannot be predicted well with time-series models. On the other hand, client visiting path and semantics can be well captured with probabilistic models and Zipf-law. Based on the findings, we propose SearchGen, a synthetic workload generator to output traces for scientific literature digital libraries and search engines. A comparison between synthetic workloads and actual logged traces suggests that the synthetic workload fits well.

AB - Due to the popularity of web applications and their heavy usage, it is important to obtain a good understanding of their workloads in order to improve performance of search services. Existing works have typically focused on generic web workloads without putting emphasis on specific domains. In this paper, we analyze the usage logs of CiteSeer, a scientific literature digital library and search engine, to characterize workloads for both robots and users. Essential ingredients that contribute to workloads are proposed. Among them we find the access intervals show high variance, and thus cannot be predicted well with time-series models. On the other hand, client visiting path and semantics can be well captured with probabilistic models and Zipf-law. Based on the findings, we propose SearchGen, a synthetic workload generator to output traces for scientific literature digital libraries and search engines. A comparison between synthetic workloads and actual logged traces suggests that the synthetic workload fits well.

UR - http://www.scopus.com/inward/record.url?scp=36349033571&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=36349033571&partnerID=8YFLogxK

U2 - 10.1145/1255175.1255203

DO - 10.1145/1255175.1255203

M3 - Conference contribution

AN - SCOPUS:36349033571

SN - 1595936440

SN - 9781595936448

T3 - Proceedings of the ACM International Conference on Digital Libraries

SP - 137

EP - 146

BT - Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007

ER -

Li H, Lee WC, Sivasubramaniam A, Giles L. SearchGen: A synthetic workload generator for scientific literature digital libraries and search engines. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. 2007. p. 137-146. (Proceedings of the ACM International Conference on Digital Libraries). https://doi.org/10.1145/1255175.1255203