Predicting runtimes of bioinformatics tools based on historical data: Five years of Galaxy usage

Anastasia Tyryshkina, Nate Coraor, Anton Nekrutenko, Jonathan Wren

Research output: Contribution to journalArticle

Abstract

One of the many technical challenges that arises when scheduling bioinformatics analyses at scale is determining the appropriate amount of memory and processing resources. Both over- and under-allocation leads to an inefficient use of computational infrastructure. Over allocation locks resources that could otherwise be used for other analyses. Under-allocation causes job failure and requires analyses to be repeated with a larger memory or runtime allowance. We address this challenge by using a historical dataset of bioinformatics analyses run on the Galaxy platform to demonstrate the feasibility of an online service for resource requirement estimation. Results: Here we introduced the Galaxy job run dataset and tested popular machine learning models on the task of resource usage prediction. We include three popular forest models: the extra trees regressor, the gradient boosting regressor and the random forest regressor, and find that random forests perform best in the runtime prediction task. We also present two methods of choosing walltimes for previously unseen jobs. Quantile regression forests are more accurate in their predictions, and grant the ability to improve performance by changing the confidence of the estimates. However, the sizes of the confidence intervals are variable and cannot be absolutely constrained. Random forest classifiers address this problem by providing control over the size of the prediction intervals with an accuracy that is comparable to that of the regressor. We show that estimating the memory requirements of a job is possible using the same methods, which as far as we know, has not been done before. Such estimation can be highly beneficial for accurate resource allocation. Availability and implementation: Source code available at https://github.com/atyryshkina/algorithm-performance-analysis, implemented in Python. Supplementary information: Supplementary data are available at Bioinformatics online.

Original languageEnglish (US)
Pages (from-to)3453-3460
Number of pages8
JournalBioinformatics
Volume35
Issue number18
DOIs
StatePublished - Sep 15 2019

Fingerprint

Galaxies
Historical Data
Bioinformatics
Computational Biology
Random Forest
Resources
Resource Allocation
Data storage equipment
Prediction
Boidae
Prediction Interval
Python
Aptitude
Quantile Regression
Algorithm Analysis
Requirements
Boosting
Resource allocation
Performance Analysis
Confidence

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics

Cite this

Tyryshkina, Anastasia ; Coraor, Nate ; Nekrutenko, Anton ; Wren, Jonathan. / Predicting runtimes of bioinformatics tools based on historical data : Five years of Galaxy usage. In: Bioinformatics. 2019 ; Vol. 35, No. 18. pp. 3453-3460.
@article{ceb907a576ac435c820d19a869604669,
title = "Predicting runtimes of bioinformatics tools based on historical data: Five years of Galaxy usage",
abstract = "One of the many technical challenges that arises when scheduling bioinformatics analyses at scale is determining the appropriate amount of memory and processing resources. Both over- and under-allocation leads to an inefficient use of computational infrastructure. Over allocation locks resources that could otherwise be used for other analyses. Under-allocation causes job failure and requires analyses to be repeated with a larger memory or runtime allowance. We address this challenge by using a historical dataset of bioinformatics analyses run on the Galaxy platform to demonstrate the feasibility of an online service for resource requirement estimation. Results: Here we introduced the Galaxy job run dataset and tested popular machine learning models on the task of resource usage prediction. We include three popular forest models: the extra trees regressor, the gradient boosting regressor and the random forest regressor, and find that random forests perform best in the runtime prediction task. We also present two methods of choosing walltimes for previously unseen jobs. Quantile regression forests are more accurate in their predictions, and grant the ability to improve performance by changing the confidence of the estimates. However, the sizes of the confidence intervals are variable and cannot be absolutely constrained. Random forest classifiers address this problem by providing control over the size of the prediction intervals with an accuracy that is comparable to that of the regressor. We show that estimating the memory requirements of a job is possible using the same methods, which as far as we know, has not been done before. Such estimation can be highly beneficial for accurate resource allocation. Availability and implementation: Source code available at https://github.com/atyryshkina/algorithm-performance-analysis, implemented in Python. Supplementary information: Supplementary data are available at Bioinformatics online.",
author = "Anastasia Tyryshkina and Nate Coraor and Anton Nekrutenko and Jonathan Wren",
year = "2019",
month = "9",
day = "15",
doi = "10.1093/bioinformatics/btz054",
language = "English (US)",
volume = "35",
pages = "3453--3460",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "18",

}

Predicting runtimes of bioinformatics tools based on historical data : Five years of Galaxy usage. / Tyryshkina, Anastasia; Coraor, Nate; Nekrutenko, Anton; Wren, Jonathan.

In: Bioinformatics, Vol. 35, No. 18, 15.09.2019, p. 3453-3460.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Predicting runtimes of bioinformatics tools based on historical data

T2 - Five years of Galaxy usage

AU - Tyryshkina, Anastasia

AU - Coraor, Nate

AU - Nekrutenko, Anton

AU - Wren, Jonathan

PY - 2019/9/15

Y1 - 2019/9/15

N2 - One of the many technical challenges that arises when scheduling bioinformatics analyses at scale is determining the appropriate amount of memory and processing resources. Both over- and under-allocation leads to an inefficient use of computational infrastructure. Over allocation locks resources that could otherwise be used for other analyses. Under-allocation causes job failure and requires analyses to be repeated with a larger memory or runtime allowance. We address this challenge by using a historical dataset of bioinformatics analyses run on the Galaxy platform to demonstrate the feasibility of an online service for resource requirement estimation. Results: Here we introduced the Galaxy job run dataset and tested popular machine learning models on the task of resource usage prediction. We include three popular forest models: the extra trees regressor, the gradient boosting regressor and the random forest regressor, and find that random forests perform best in the runtime prediction task. We also present two methods of choosing walltimes for previously unseen jobs. Quantile regression forests are more accurate in their predictions, and grant the ability to improve performance by changing the confidence of the estimates. However, the sizes of the confidence intervals are variable and cannot be absolutely constrained. Random forest classifiers address this problem by providing control over the size of the prediction intervals with an accuracy that is comparable to that of the regressor. We show that estimating the memory requirements of a job is possible using the same methods, which as far as we know, has not been done before. Such estimation can be highly beneficial for accurate resource allocation. Availability and implementation: Source code available at https://github.com/atyryshkina/algorithm-performance-analysis, implemented in Python. Supplementary information: Supplementary data are available at Bioinformatics online.

AB - One of the many technical challenges that arises when scheduling bioinformatics analyses at scale is determining the appropriate amount of memory and processing resources. Both over- and under-allocation leads to an inefficient use of computational infrastructure. Over allocation locks resources that could otherwise be used for other analyses. Under-allocation causes job failure and requires analyses to be repeated with a larger memory or runtime allowance. We address this challenge by using a historical dataset of bioinformatics analyses run on the Galaxy platform to demonstrate the feasibility of an online service for resource requirement estimation. Results: Here we introduced the Galaxy job run dataset and tested popular machine learning models on the task of resource usage prediction. We include three popular forest models: the extra trees regressor, the gradient boosting regressor and the random forest regressor, and find that random forests perform best in the runtime prediction task. We also present two methods of choosing walltimes for previously unseen jobs. Quantile regression forests are more accurate in their predictions, and grant the ability to improve performance by changing the confidence of the estimates. However, the sizes of the confidence intervals are variable and cannot be absolutely constrained. Random forest classifiers address this problem by providing control over the size of the prediction intervals with an accuracy that is comparable to that of the regressor. We show that estimating the memory requirements of a job is possible using the same methods, which as far as we know, has not been done before. Such estimation can be highly beneficial for accurate resource allocation. Availability and implementation: Source code available at https://github.com/atyryshkina/algorithm-performance-analysis, implemented in Python. Supplementary information: Supplementary data are available at Bioinformatics online.

UR - http://www.scopus.com/inward/record.url?scp=85072508575&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85072508575&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btz054

DO - 10.1093/bioinformatics/btz054

M3 - Article

C2 - 30698642

AN - SCOPUS:85072508575

VL - 35

SP - 3453

EP - 3460

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 18

ER -