Winnowing sequences from a database search

Piotr Berman, Zheng Zhang, Yuri I. Wolf, Eugene V. Koonin, Webb Miller

    Research output: Contribution to conferencePaper

    2 Citations (Scopus)

    Abstract

    In database searches for sequence similarity, matches to a distinct sequence region (e.g. protein domain) are frequently by numerous matches to another region of the same sequence. In order to cope with this problem, algorithms are developed to discard redundant matches. One model for this problem begins with a list of intervals, each with an associated score; each interval gives the range of positions in the query sequence that align to a database sequence, and the score is that of the alignment. If interval I is contained in interval J, and I's score is less than J's, then I is said to be dominated by J. The problem is then to identify each interval that is dominated by at least K other intervals, where K is a given level of `tolerable redundancy.' An algorithm is developed to solve the problem in O(N log N) time and O(N*) space, where N is the number of intervals and N* is a precisely defined value that never exceeds N and is frequently much smaller. This criterion for discarding database hits has been implemented in the Blast program, as illustrated herein with examples. Several variations and extensions of this approach are also described.

    Original languageEnglish (US)
    Pages50-58
    Number of pages9
    StatePublished - Jan 1 1999
    EventProceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99 - Lyon
    Duration: Apr 11 1999Apr 14 1999

    Other

    OtherProceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99
    CityLyon
    Period4/11/994/14/99

    Fingerprint

    Databases
    Redundancy
    Proteins
    Protein Domains

    All Science Journal Classification (ASJC) codes

    • Computer Science(all)
    • Biochemistry, Genetics and Molecular Biology(all)

    Cite this

    Berman, P., Zhang, Z., Wolf, Y. I., Koonin, E. V., & Miller, W. (1999). Winnowing sequences from a database search. 50-58. Paper presented at Proceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99, Lyon, .
    Berman, Piotr ; Zhang, Zheng ; Wolf, Yuri I. ; Koonin, Eugene V. ; Miller, Webb. / Winnowing sequences from a database search. Paper presented at Proceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99, Lyon, .9 p.
    @conference{8135ffe7b1924db686bc7faf8bbf8e99,
    title = "Winnowing sequences from a database search",
    abstract = "In database searches for sequence similarity, matches to a distinct sequence region (e.g. protein domain) are frequently by numerous matches to another region of the same sequence. In order to cope with this problem, algorithms are developed to discard redundant matches. One model for this problem begins with a list of intervals, each with an associated score; each interval gives the range of positions in the query sequence that align to a database sequence, and the score is that of the alignment. If interval I is contained in interval J, and I's score is less than J's, then I is said to be dominated by J. The problem is then to identify each interval that is dominated by at least K other intervals, where K is a given level of `tolerable redundancy.' An algorithm is developed to solve the problem in O(N log N) time and O(N*) space, where N is the number of intervals and N* is a precisely defined value that never exceeds N and is frequently much smaller. This criterion for discarding database hits has been implemented in the Blast program, as illustrated herein with examples. Several variations and extensions of this approach are also described.",
    author = "Piotr Berman and Zheng Zhang and Wolf, {Yuri I.} and Koonin, {Eugene V.} and Webb Miller",
    year = "1999",
    month = "1",
    day = "1",
    language = "English (US)",
    pages = "50--58",
    note = "Proceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99 ; Conference date: 11-04-1999 Through 14-04-1999",

    }

    Berman, P, Zhang, Z, Wolf, YI, Koonin, EV & Miller, W 1999, 'Winnowing sequences from a database search', Paper presented at Proceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99, Lyon, 4/11/99 - 4/14/99 pp. 50-58.

    Winnowing sequences from a database search. / Berman, Piotr; Zhang, Zheng; Wolf, Yuri I.; Koonin, Eugene V.; Miller, Webb.

    1999. 50-58 Paper presented at Proceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99, Lyon, .

    Research output: Contribution to conferencePaper

    TY - CONF

    T1 - Winnowing sequences from a database search

    AU - Berman, Piotr

    AU - Zhang, Zheng

    AU - Wolf, Yuri I.

    AU - Koonin, Eugene V.

    AU - Miller, Webb

    PY - 1999/1/1

    Y1 - 1999/1/1

    N2 - In database searches for sequence similarity, matches to a distinct sequence region (e.g. protein domain) are frequently by numerous matches to another region of the same sequence. In order to cope with this problem, algorithms are developed to discard redundant matches. One model for this problem begins with a list of intervals, each with an associated score; each interval gives the range of positions in the query sequence that align to a database sequence, and the score is that of the alignment. If interval I is contained in interval J, and I's score is less than J's, then I is said to be dominated by J. The problem is then to identify each interval that is dominated by at least K other intervals, where K is a given level of `tolerable redundancy.' An algorithm is developed to solve the problem in O(N log N) time and O(N*) space, where N is the number of intervals and N* is a precisely defined value that never exceeds N and is frequently much smaller. This criterion for discarding database hits has been implemented in the Blast program, as illustrated herein with examples. Several variations and extensions of this approach are also described.

    AB - In database searches for sequence similarity, matches to a distinct sequence region (e.g. protein domain) are frequently by numerous matches to another region of the same sequence. In order to cope with this problem, algorithms are developed to discard redundant matches. One model for this problem begins with a list of intervals, each with an associated score; each interval gives the range of positions in the query sequence that align to a database sequence, and the score is that of the alignment. If interval I is contained in interval J, and I's score is less than J's, then I is said to be dominated by J. The problem is then to identify each interval that is dominated by at least K other intervals, where K is a given level of `tolerable redundancy.' An algorithm is developed to solve the problem in O(N log N) time and O(N*) space, where N is the number of intervals and N* is a precisely defined value that never exceeds N and is frequently much smaller. This criterion for discarding database hits has been implemented in the Blast program, as illustrated herein with examples. Several variations and extensions of this approach are also described.

    UR - http://www.scopus.com/inward/record.url?scp=0032642169&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=0032642169&partnerID=8YFLogxK

    M3 - Paper

    AN - SCOPUS:0032642169

    SP - 50

    EP - 58

    ER -

    Berman P, Zhang Z, Wolf YI, Koonin EV, Miller W. Winnowing sequences from a database search. 1999. Paper presented at Proceedings of the 1999 3rd Annual International Conference on Computational Molecular Biology, RECOMB '99, Lyon, .