Scheduling shared scans of large data files

Parag Agrawal, Daniel Kifer, Christopher Olston

Research output: Contribution to journalArticle

46 Citations (Scopus)

Abstract

We study how best to schedule scans of large data files, in the presence of many simultaneous requests to a common set of files. The objective is to maximize the overall rate of processing these files, by sharing scans of the same file as aggressively as possible, without imposing undue wait time on individual jobs. This scheduling problem arises in batch data processing environments such as Map-Reduce systems, some of which handle tens of thousands of processing requests daily, over a shared set of files. As we demonstrate, conventional scheduling techniques such as shortest-job-first do not perform well in the presence of cross-job sharing opportunities. We derive a new family of scheduling policies specifically targeted to sharable workloads. Our scheduling policies revolve around the notion that, all else being equal, it is good to schedule nonsharable scans ahead of ones that can share IO work with future jobs, if the arrival rate of sharable future jobs is expected to be high. We evaluate our policies via simulation over varied synthetic and real workloads, and demonstrate significant performance gains compared with conventional scheduling approaches.

Original languageEnglish (US)
Pages (from-to)958-969
Number of pages12
JournalProceedings of the VLDB Endowment
Volume1
Issue number1
DOIs
StatePublished - 2008

Fingerprint

Scheduling
Batch data processing
Processing

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this

Agrawal, Parag ; Kifer, Daniel ; Olston, Christopher. / Scheduling shared scans of large data files. In: Proceedings of the VLDB Endowment. 2008 ; Vol. 1, No. 1. pp. 958-969.
@article{7d6b2b37822e4f808f407adf3451cfe3,
title = "Scheduling shared scans of large data files",
abstract = "We study how best to schedule scans of large data files, in the presence of many simultaneous requests to a common set of files. The objective is to maximize the overall rate of processing these files, by sharing scans of the same file as aggressively as possible, without imposing undue wait time on individual jobs. This scheduling problem arises in batch data processing environments such as Map-Reduce systems, some of which handle tens of thousands of processing requests daily, over a shared set of files. As we demonstrate, conventional scheduling techniques such as shortest-job-first do not perform well in the presence of cross-job sharing opportunities. We derive a new family of scheduling policies specifically targeted to sharable workloads. Our scheduling policies revolve around the notion that, all else being equal, it is good to schedule nonsharable scans ahead of ones that can share IO work with future jobs, if the arrival rate of sharable future jobs is expected to be high. We evaluate our policies via simulation over varied synthetic and real workloads, and demonstrate significant performance gains compared with conventional scheduling approaches.",
author = "Parag Agrawal and Daniel Kifer and Christopher Olston",
year = "2008",
doi = "10.14778/1453856.1453960",
language = "English (US)",
volume = "1",
pages = "958--969",
journal = "Proceedings of the VLDB Endowment",
issn = "2150-8097",
publisher = "Very Large Data Base Endowment Inc.",
number = "1",

}

Scheduling shared scans of large data files. / Agrawal, Parag; Kifer, Daniel; Olston, Christopher.

In: Proceedings of the VLDB Endowment, Vol. 1, No. 1, 2008, p. 958-969.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Scheduling shared scans of large data files

AU - Agrawal, Parag

AU - Kifer, Daniel

AU - Olston, Christopher

PY - 2008

Y1 - 2008

N2 - We study how best to schedule scans of large data files, in the presence of many simultaneous requests to a common set of files. The objective is to maximize the overall rate of processing these files, by sharing scans of the same file as aggressively as possible, without imposing undue wait time on individual jobs. This scheduling problem arises in batch data processing environments such as Map-Reduce systems, some of which handle tens of thousands of processing requests daily, over a shared set of files. As we demonstrate, conventional scheduling techniques such as shortest-job-first do not perform well in the presence of cross-job sharing opportunities. We derive a new family of scheduling policies specifically targeted to sharable workloads. Our scheduling policies revolve around the notion that, all else being equal, it is good to schedule nonsharable scans ahead of ones that can share IO work with future jobs, if the arrival rate of sharable future jobs is expected to be high. We evaluate our policies via simulation over varied synthetic and real workloads, and demonstrate significant performance gains compared with conventional scheduling approaches.

AB - We study how best to schedule scans of large data files, in the presence of many simultaneous requests to a common set of files. The objective is to maximize the overall rate of processing these files, by sharing scans of the same file as aggressively as possible, without imposing undue wait time on individual jobs. This scheduling problem arises in batch data processing environments such as Map-Reduce systems, some of which handle tens of thousands of processing requests daily, over a shared set of files. As we demonstrate, conventional scheduling techniques such as shortest-job-first do not perform well in the presence of cross-job sharing opportunities. We derive a new family of scheduling policies specifically targeted to sharable workloads. Our scheduling policies revolve around the notion that, all else being equal, it is good to schedule nonsharable scans ahead of ones that can share IO work with future jobs, if the arrival rate of sharable future jobs is expected to be high. We evaluate our policies via simulation over varied synthetic and real workloads, and demonstrate significant performance gains compared with conventional scheduling approaches.

UR - http://www.scopus.com/inward/record.url?scp=72249107952&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=72249107952&partnerID=8YFLogxK

U2 - 10.14778/1453856.1453960

DO - 10.14778/1453856.1453960

M3 - Article

AN - SCOPUS:72249107952

VL - 1

SP - 958

EP - 969

JO - Proceedings of the VLDB Endowment

JF - Proceedings of the VLDB Endowment

SN - 2150-8097

IS - 1

ER -