### Abstract

In database searches for sequence similarity, matches to a distinct sequence region (e.g. protein domain) are frequently by numerous matches to another region of the same sequence. In order to cope with this problem, algorithms are developed to discard redundant matches. One model for this problem begins with a list of intervals, each with an associated score; each interval gives the range of positions in the query sequence that align to a database sequence, and the score is that of the alignment. If interval I is contained in interval J, and I's score is less than J's, then I is said to be dominated by J. The problem is then to identify each interval that is dominated by at least K other intervals, where K is a given level of `tolerable redundancy.' An algorithm is developed to solve the problem in O(N log N) time and O(N*) space, where N is the number of intervals and N* is a precisely defined value that never exceeds N and is frequently much smaller. This criterion for discarding database hits has been implemented in the Blast program, as illustrated herein with examples. Several variations and extensions of this approach are also described.

Original language | English (US) |
---|---|

Pages | 50-58 |

Number of pages | 9 |

State | Published - Jan 1 1999 |

Event | Proceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99 - Lyon Duration: Apr 11 1999 → Apr 14 1999 |

### Other

Other | Proceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99 |
---|---|

City | Lyon |

Period | 4/11/99 → 4/14/99 |

### Fingerprint

### All Science Journal Classification (ASJC) codes

- Computer Science(all)
- Biochemistry, Genetics and Molecular Biology(all)

### Cite this

*Winnowing sequences from a database search*. 50-58. Paper presented at Proceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99, Lyon, .

}

**Winnowing sequences from a database search.** / Berman, Piotr; Zhang, Zheng; Wolf, Yuri I.; Koonin, Eugene V.; Miller, Webb.

Research output: Contribution to conference › Paper

TY - CONF

T1 - Winnowing sequences from a database search

AU - Berman, Piotr

AU - Zhang, Zheng

AU - Wolf, Yuri I.

AU - Koonin, Eugene V.

AU - Miller, Webb

PY - 1999/1/1

Y1 - 1999/1/1

N2 - In database searches for sequence similarity, matches to a distinct sequence region (e.g. protein domain) are frequently by numerous matches to another region of the same sequence. In order to cope with this problem, algorithms are developed to discard redundant matches. One model for this problem begins with a list of intervals, each with an associated score; each interval gives the range of positions in the query sequence that align to a database sequence, and the score is that of the alignment. If interval I is contained in interval J, and I's score is less than J's, then I is said to be dominated by J. The problem is then to identify each interval that is dominated by at least K other intervals, where K is a given level of `tolerable redundancy.' An algorithm is developed to solve the problem in O(N log N) time and O(N*) space, where N is the number of intervals and N* is a precisely defined value that never exceeds N and is frequently much smaller. This criterion for discarding database hits has been implemented in the Blast program, as illustrated herein with examples. Several variations and extensions of this approach are also described.

AB - In database searches for sequence similarity, matches to a distinct sequence region (e.g. protein domain) are frequently by numerous matches to another region of the same sequence. In order to cope with this problem, algorithms are developed to discard redundant matches. One model for this problem begins with a list of intervals, each with an associated score; each interval gives the range of positions in the query sequence that align to a database sequence, and the score is that of the alignment. If interval I is contained in interval J, and I's score is less than J's, then I is said to be dominated by J. The problem is then to identify each interval that is dominated by at least K other intervals, where K is a given level of `tolerable redundancy.' An algorithm is developed to solve the problem in O(N log N) time and O(N*) space, where N is the number of intervals and N* is a precisely defined value that never exceeds N and is frequently much smaller. This criterion for discarding database hits has been implemented in the Blast program, as illustrated herein with examples. Several variations and extensions of this approach are also described.

UR - http://www.scopus.com/inward/record.url?scp=0032642169&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0032642169&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:0032642169

SP - 50

EP - 58

ER -