FastRNABindR: Fast and accurate prediction of protein-RNA interface residues

Yasser El-Manzalawy, Mostafa Abbas, Qutaibah Malluhi, Vasant Honavar

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1% of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.

Original languageEnglish (US)
Article numbere0158445
JournalPloS one
Volume11
Issue number7
DOIs
StatePublished - Jul 1 2016

Fingerprint

RNA
prediction
Position-Specific Scoring Matrices
Proteins
amino acid sequences
proteins
Protein Databases
artificial intelligence
Interfaces (computer)
gene expression regulation
protein structure
Servers
Biological Phenomena
Virus Assembly
RNA Viruses
Gene Expression Regulation
protein synthesis
viruses
Viruses
Amino Acid Sequence

All Science Journal Classification (ASJC) codes

  • Biochemistry, Genetics and Molecular Biology(all)
  • Agricultural and Biological Sciences(all)
  • General

Cite this

El-Manzalawy, Yasser ; Abbas, Mostafa ; Malluhi, Qutaibah ; Honavar, Vasant. / FastRNABindR : Fast and accurate prediction of protein-RNA interface residues. In: PloS one. 2016 ; Vol. 11, No. 7.
@article{771cf24355ba44ec9ccf5ac72c84bda6,
title = "FastRNABindR: Fast and accurate prediction of protein-RNA interface residues",
abstract = "A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1{\%} of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.",
author = "Yasser El-Manzalawy and Mostafa Abbas and Qutaibah Malluhi and Vasant Honavar",
year = "2016",
month = "7",
day = "1",
doi = "10.1371/journal.pone.0158445",
language = "English (US)",
volume = "11",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "7",

}

FastRNABindR : Fast and accurate prediction of protein-RNA interface residues. / El-Manzalawy, Yasser; Abbas, Mostafa; Malluhi, Qutaibah; Honavar, Vasant.

In: PloS one, Vol. 11, No. 7, e0158445, 01.07.2016.

Research output: Contribution to journalArticle

TY - JOUR

T1 - FastRNABindR

T2 - Fast and accurate prediction of protein-RNA interface residues

AU - El-Manzalawy, Yasser

AU - Abbas, Mostafa

AU - Malluhi, Qutaibah

AU - Honavar, Vasant

PY - 2016/7/1

Y1 - 2016/7/1

N2 - A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1% of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.

AB - A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1% of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.

UR - http://www.scopus.com/inward/record.url?scp=84978922991&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84978922991&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0158445

DO - 10.1371/journal.pone.0158445

M3 - Article

C2 - 27383535

AN - SCOPUS:84978922991

VL - 11

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 7

M1 - e0158445

ER -