An Eulerian Path Approach to Global Multiple Alignment for DNA Sequences

Yu Zhang, Michael S. Waterman

Research output: Contribution to journalArticle

18 Citations (Scopus)

Abstract

With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one configuration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70% pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance.

Original languageEnglish (US)
Pages (from-to)803-819
Number of pages17
JournalJournal of Computational Biology
Volume10
Issue number6
DOIs
StatePublished - Dec 1 2003

Fingerprint

DNA sequences
DNA Sequence
Alignment
Path
Sequence Alignment
Microcomputers
DNA Sequence Analysis
Arabidopsis
Motivation
Genome
DNA
Fragment
Heuristic algorithms
De Bruijn Graph
Personal computers
DNA Sequencing
Multiple Sequence Alignment
Personal Computer
Genes
Heuristic algorithm

All Science Journal Classification (ASJC) codes

  • Modeling and Simulation
  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Computational Theory and Mathematics

Cite this

Zhang, Yu ; Waterman, Michael S. / An Eulerian Path Approach to Global Multiple Alignment for DNA Sequences. In: Journal of Computational Biology. 2003 ; Vol. 10, No. 6. pp. 803-819.
@article{79c13f81c4254fd09881fac85789fd61,
title = "An Eulerian Path Approach to Global Multiple Alignment for DNA Sequences",
abstract = "With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one configuration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70{\%} pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance.",
author = "Yu Zhang and Waterman, {Michael S.}",
year = "2003",
month = "12",
day = "1",
doi = "10.1089/106652703322756096",
language = "English (US)",
volume = "10",
pages = "803--819",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "6",

}

An Eulerian Path Approach to Global Multiple Alignment for DNA Sequences. / Zhang, Yu; Waterman, Michael S.

In: Journal of Computational Biology, Vol. 10, No. 6, 01.12.2003, p. 803-819.

Research output: Contribution to journalArticle

TY - JOUR

T1 - An Eulerian Path Approach to Global Multiple Alignment for DNA Sequences

AU - Zhang, Yu

AU - Waterman, Michael S.

PY - 2003/12/1

Y1 - 2003/12/1

N2 - With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one configuration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70% pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance.

AB - With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one configuration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70% pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance.

UR - http://www.scopus.com/inward/record.url?scp=0742304251&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0742304251&partnerID=8YFLogxK

U2 - 10.1089/106652703322756096

DO - 10.1089/106652703322756096

M3 - Article

C2 - 14980012

AN - SCOPUS:0742304251

VL - 10

SP - 803

EP - 819

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 6

ER -