On computing breakpoint distances for genomes with duplicate genes

Mingfu Shao, Bernard M.E. Moret

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

A fundamental problem in comparative genomics is to compute the distance between two genomes in terms of its higher-level organization (given by genes or syntenic blocks). For two genomes without duplicate genes, we can easily define (and almost always efficiently compute) a variety of distance measures, but the problem is NP-hard under most models when genomes contain duplicate genes. To tackle duplicate genes, three formulations (exemplar, maximum matching, and any matching) have been proposed, all of which aim to build a matching between homologous genes so as to minimize some distance measure. Of the many distance measures, the breakpoint distance (the number of non-conserved adjacencies) was the first one to be studied and remains of significant interest because of its simplicity and model-free property. The three breakpoint distance problems corresponding to the three formulations have been widely studied. Although we provided last year a solution for the exemplar problem that runs very fast on full genomes, computing optimal solutions for the other two problems has remained challenging. In this paper, we describe very fast, exact algorithms for these two problems. Our algorithms rely on a compact integer-linear program that we further simplify by developing an algorithm to remove variables, based on new results on the structure of adjacencies and matchings. Through extensive experiments using both simulations and biological datasets, we show that our algorithms run very fast (in seconds) on mammalian genomes and scale well beyond. We also apply these algorithms (as well as the classic orthology tool MSOAR) to create orthology assignment, then compare their quality in terms of both accuracy and coverage. We find that our algorithm for the “any matching” formulation significantly outperforms other methods in terms of accuracy while achieving nearly maximum coverage.

Original languageEnglish (US)
Title of host publicationResearch in Computational Molecular Biology - 20th Annual Conference, RECOMB 2016, Proceedings
EditorsMona Singh
PublisherSpringer Verlag
Pages189-203
Number of pages15
ISBN (Print)9783319319568
DOIs
StatePublished - Jan 1 2016
Event20th Annual Conference on Research in Computational Molecular Biology, RECOMB 2016 - Santa Monica, United States
Duration: Apr 17 2016Apr 21 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9649
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other20th Annual Conference on Research in Computational Molecular Biology, RECOMB 2016
CountryUnited States
CitySanta Monica
Period4/17/164/21/16

Fingerprint

Genome
Genes
Gene
Computing
Distance Measure
Adjacency
Formulation
Coverage
Maximum Matching
Comparative Genomics
Integer Program
Exact Algorithms
Linear Program
Simplicity
Simplify
Assignment
NP-complete problem
Optimal Solution
Minimise
Computational complexity

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Shao, M., & Moret, B. M. E. (2016). On computing breakpoint distances for genomes with duplicate genes. In M. Singh (Ed.), Research in Computational Molecular Biology - 20th Annual Conference, RECOMB 2016, Proceedings (pp. 189-203). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9649). Springer Verlag. https://doi.org/10.1007/978-3-319-31957-5_14
Shao, Mingfu ; Moret, Bernard M.E. / On computing breakpoint distances for genomes with duplicate genes. Research in Computational Molecular Biology - 20th Annual Conference, RECOMB 2016, Proceedings. editor / Mona Singh. Springer Verlag, 2016. pp. 189-203 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{a1bddbf981bc48778ebb3dd7711cbadf,
title = "On computing breakpoint distances for genomes with duplicate genes",
abstract = "A fundamental problem in comparative genomics is to compute the distance between two genomes in terms of its higher-level organization (given by genes or syntenic blocks). For two genomes without duplicate genes, we can easily define (and almost always efficiently compute) a variety of distance measures, but the problem is NP-hard under most models when genomes contain duplicate genes. To tackle duplicate genes, three formulations (exemplar, maximum matching, and any matching) have been proposed, all of which aim to build a matching between homologous genes so as to minimize some distance measure. Of the many distance measures, the breakpoint distance (the number of non-conserved adjacencies) was the first one to be studied and remains of significant interest because of its simplicity and model-free property. The three breakpoint distance problems corresponding to the three formulations have been widely studied. Although we provided last year a solution for the exemplar problem that runs very fast on full genomes, computing optimal solutions for the other two problems has remained challenging. In this paper, we describe very fast, exact algorithms for these two problems. Our algorithms rely on a compact integer-linear program that we further simplify by developing an algorithm to remove variables, based on new results on the structure of adjacencies and matchings. Through extensive experiments using both simulations and biological datasets, we show that our algorithms run very fast (in seconds) on mammalian genomes and scale well beyond. We also apply these algorithms (as well as the classic orthology tool MSOAR) to create orthology assignment, then compare their quality in terms of both accuracy and coverage. We find that our algorithm for the “any matching” formulation significantly outperforms other methods in terms of accuracy while achieving nearly maximum coverage.",
author = "Mingfu Shao and Moret, {Bernard M.E.}",
year = "2016",
month = "1",
day = "1",
doi = "10.1007/978-3-319-31957-5_14",
language = "English (US)",
isbn = "9783319319568",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "189--203",
editor = "Mona Singh",
booktitle = "Research in Computational Molecular Biology - 20th Annual Conference, RECOMB 2016, Proceedings",
address = "Germany",

}

Shao, M & Moret, BME 2016, On computing breakpoint distances for genomes with duplicate genes. in M Singh (ed.), Research in Computational Molecular Biology - 20th Annual Conference, RECOMB 2016, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9649, Springer Verlag, pp. 189-203, 20th Annual Conference on Research in Computational Molecular Biology, RECOMB 2016, Santa Monica, United States, 4/17/16. https://doi.org/10.1007/978-3-319-31957-5_14

On computing breakpoint distances for genomes with duplicate genes. / Shao, Mingfu; Moret, Bernard M.E.

Research in Computational Molecular Biology - 20th Annual Conference, RECOMB 2016, Proceedings. ed. / Mona Singh. Springer Verlag, 2016. p. 189-203 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9649).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - On computing breakpoint distances for genomes with duplicate genes

AU - Shao, Mingfu

AU - Moret, Bernard M.E.

PY - 2016/1/1

Y1 - 2016/1/1

N2 - A fundamental problem in comparative genomics is to compute the distance between two genomes in terms of its higher-level organization (given by genes or syntenic blocks). For two genomes without duplicate genes, we can easily define (and almost always efficiently compute) a variety of distance measures, but the problem is NP-hard under most models when genomes contain duplicate genes. To tackle duplicate genes, three formulations (exemplar, maximum matching, and any matching) have been proposed, all of which aim to build a matching between homologous genes so as to minimize some distance measure. Of the many distance measures, the breakpoint distance (the number of non-conserved adjacencies) was the first one to be studied and remains of significant interest because of its simplicity and model-free property. The three breakpoint distance problems corresponding to the three formulations have been widely studied. Although we provided last year a solution for the exemplar problem that runs very fast on full genomes, computing optimal solutions for the other two problems has remained challenging. In this paper, we describe very fast, exact algorithms for these two problems. Our algorithms rely on a compact integer-linear program that we further simplify by developing an algorithm to remove variables, based on new results on the structure of adjacencies and matchings. Through extensive experiments using both simulations and biological datasets, we show that our algorithms run very fast (in seconds) on mammalian genomes and scale well beyond. We also apply these algorithms (as well as the classic orthology tool MSOAR) to create orthology assignment, then compare their quality in terms of both accuracy and coverage. We find that our algorithm for the “any matching” formulation significantly outperforms other methods in terms of accuracy while achieving nearly maximum coverage.

AB - A fundamental problem in comparative genomics is to compute the distance between two genomes in terms of its higher-level organization (given by genes or syntenic blocks). For two genomes without duplicate genes, we can easily define (and almost always efficiently compute) a variety of distance measures, but the problem is NP-hard under most models when genomes contain duplicate genes. To tackle duplicate genes, three formulations (exemplar, maximum matching, and any matching) have been proposed, all of which aim to build a matching between homologous genes so as to minimize some distance measure. Of the many distance measures, the breakpoint distance (the number of non-conserved adjacencies) was the first one to be studied and remains of significant interest because of its simplicity and model-free property. The three breakpoint distance problems corresponding to the three formulations have been widely studied. Although we provided last year a solution for the exemplar problem that runs very fast on full genomes, computing optimal solutions for the other two problems has remained challenging. In this paper, we describe very fast, exact algorithms for these two problems. Our algorithms rely on a compact integer-linear program that we further simplify by developing an algorithm to remove variables, based on new results on the structure of adjacencies and matchings. Through extensive experiments using both simulations and biological datasets, we show that our algorithms run very fast (in seconds) on mammalian genomes and scale well beyond. We also apply these algorithms (as well as the classic orthology tool MSOAR) to create orthology assignment, then compare their quality in terms of both accuracy and coverage. We find that our algorithm for the “any matching” formulation significantly outperforms other methods in terms of accuracy while achieving nearly maximum coverage.

UR - http://www.scopus.com/inward/record.url?scp=84963994529&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84963994529&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-31957-5_14

DO - 10.1007/978-3-319-31957-5_14

M3 - Conference contribution

AN - SCOPUS:84963994529

SN - 9783319319568

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 189

EP - 203

BT - Research in Computational Molecular Biology - 20th Annual Conference, RECOMB 2016, Proceedings

A2 - Singh, Mona

PB - Springer Verlag

ER -

Shao M, Moret BME. On computing breakpoint distances for genomes with duplicate genes. In Singh M, editor, Research in Computational Molecular Biology - 20th Annual Conference, RECOMB 2016, Proceedings. Springer Verlag. 2016. p. 189-203. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-31957-5_14