POODLE-W is a web application for predicting which proteins are mostly disordered. Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. POODLE-W is trained with a huge amount of structure-unknown sequences as well as structure-known sequences by using spectral graph transducer . Method Spectral Graph Transducer is a binary classification algorithm based on semi-supervised learning, which is developed by Dr. Joachims. It constructs a k-nearest neighbor (kNN) graph with both labeled and unlabeled examples as vertices, and the edge weight between two vertices represents their similarity. If the graph is separated into two subgraphs, both labeled and unlabeled vertices are classified into two categories. The SGT takes into account both the prediction accuracy of labeled training data and the distribution of unlabeled data, because it cuts the kNN graph so as to minimize both the misclassification of labeled vertices and the sum of edges weights across the cut. POODLE-W apply the SGT to the disorder prediction problem with structure-known sequences as labeled data and structure-unknown sequences, including query sequences, as unlabeled data. The proposed method can therefore be used for training both structure-known sequences and a huge amount of structure-unknown sequences, and it creates a model that incorporates a larger protein structural space. How to use Type or cut and paste a sequence into the form. Input E-mail address which you want to send the result. The amino acid sequence must be in the standard single letter code format. Multiple-FASTA-Format is also acceptable(up to 50 sequences for one submision). Publication K. Shimizu, Y. Muraoka, S. Hirose, K. tomii and T. Noguchi "Predicting mostly disordered proteins by using structure-unknown protein data", BMC Bioinformatics 2007, 8:78 . Acknowledgments The SGTlight package from Dr. T. Joachims. This research is supported by Waseda University. |