TY - GEN
T1 - Alignment seeding strategies using contiguous pyrimidine purine matches
AU - Hou, Minmei
AU - Zhang, Louxin
AU - Harris, Robert S.
PY - 2012/11/26
Y1 - 2012/11/26
N2 - Large-scale genomic pairwise aligners usually start with a seeding procedure, which scans two sequences to obtain base matches (called hits) that follow a certain pattern (called a seed). The seed pattern and size determine the sensitivity and specificity of the seeding procedure and greatly affect the alignment accuracy and computational efficiency. Much effort has been focused on obtaining an optimal (set of) spaced seed(s) to improve sensitivity. However, specificity also becomes a big concern when aligning very long genomic sequences. We present a seeding strategy that identifies contiguous pyrimidine purine (py · pu) matches. This model may improve sensitivity and specificity simultaneously compared to a contiguous base match model. We further present a seeding strategy that identifies contiguous py · pu matches with at least a certain number of contiguous base matches. This model significantly improves sensitivity and specificity simultaneously compared to the base match model. It can also achieve better sensitivity than an optimal spaced seed without loss of specificity, when the ratio of transition to transversion is high. Our examination on the CFTR region of 2M bases between human and mouse shows that this new model can have very high specificity without much loss of sensitivity compared to an optimal spaced seed. Based on the characteristics (e.g. the sequence similarity, the ratio between transition and transversion, and the lengths of gapless alignments) of alignments between human and other mammals, the new seeding strategies are promising in improving alignment quality of a wide selection of species pairs. This paper also lays the groundwork for future advancement of applying spaced patterns in these seeding strategies
AB - Large-scale genomic pairwise aligners usually start with a seeding procedure, which scans two sequences to obtain base matches (called hits) that follow a certain pattern (called a seed). The seed pattern and size determine the sensitivity and specificity of the seeding procedure and greatly affect the alignment accuracy and computational efficiency. Much effort has been focused on obtaining an optimal (set of) spaced seed(s) to improve sensitivity. However, specificity also becomes a big concern when aligning very long genomic sequences. We present a seeding strategy that identifies contiguous pyrimidine purine (py · pu) matches. This model may improve sensitivity and specificity simultaneously compared to a contiguous base match model. We further present a seeding strategy that identifies contiguous py · pu matches with at least a certain number of contiguous base matches. This model significantly improves sensitivity and specificity simultaneously compared to the base match model. It can also achieve better sensitivity than an optimal spaced seed without loss of specificity, when the ratio of transition to transversion is high. Our examination on the CFTR region of 2M bases between human and mouse shows that this new model can have very high specificity without much loss of sensitivity compared to an optimal spaced seed. Based on the characteristics (e.g. the sequence similarity, the ratio between transition and transversion, and the lengths of gapless alignments) of alignments between human and other mammals, the new seeding strategies are promising in improving alignment quality of a wide selection of species pairs. This paper also lays the groundwork for future advancement of applying spaced patterns in these seeding strategies
UR - http://www.scopus.com/inward/record.url?scp=84869383949&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84869383949&partnerID=8YFLogxK
U2 - 10.1145/2382936.2382985
DO - 10.1145/2382936.2382985
M3 - Conference contribution
AN - SCOPUS:84869383949
SN - 9781450316705
T3 - 2012 ACM Conference on Bioinformatics, Computational Biology and Biomedicine, BCB 2012
SP - 384
EP - 391
BT - 2012 ACM Conference on Bioinformatics, Computational Biology and Biomedicine, BCB 2012
T2 - 2012 ACM Conference on Bioinformatics, Computational Biology and Biomedicine, BCB 2012
Y2 - 7 October 2012 through 10 October 2012
ER -