Applicability of semi-supervised learning assumptions for gene ontology terms prediction
Gene Ontology (GO) is one of the most important resources in bioinformatics, aiming to provide a unified framework for the biological annotation of genes and proteins across all species. Predicting GO terms is an essential task for bioinformatics, but the number of available labelled proteins is in several cases insufficient for training reliable machine learning classifiers. Semi-supervised learning methods arise as a powerful solution that explodes the information contained in unlabelled data in order to improve the estimations of traditional supervised approaches. However, semi-supervised learning methods have to make strong assumptions about the nature of the training data and thus, the performance of the predictor is highly dependent on these assumptions. This paper presents an analysis of the applicability of semi-supervised learning assumptions over the specific task of GO terms prediction, focused on providing judgment elements that allow choosing the most suitable tools for specific GO terms. The results show that semi-supervised approaches significantly outperform the traditional supervised methods and that the highest performances are reached when applying the cluster assumption. Besides, it is experimentally demonstrated that cluster and manifold assumptions are complementary to each other and an analysis of which GO terms can be more prone to be correctly predicted with each assumption, is provided.
K. Chou and H. Shen, “Recent progress in protein subcellular location prediction”, Analytical Biochemistry , vol. 370, no. 1, pp. 1-16, 2007.
P. Benfey and T. Mitchell, “From Genotype to Phenotype: Systems Biology Meets Natural Variation”, Science , vol. 320, no. 5875, pp. 495-497, 2008.
M. Harris et al., “The gene ontology (GO) database and informatics resource”, Nucleic Acids Res. , vol. 32, pp. 258-261, 2004.
J. Jaramillo, J. Gallardo, C. Castellanos and A. Perera, “Predictability of gene ontology slim-terms from primary structure information in Embryophyta plant proteins”, BMC Bioinformatics, vol. 14, no. 68, pp. 1-11, 2013.
X. Zhu, “Semi-Supervised Learning Literature Survey”, University of Wisconsin–Madison, Madison, USA, Tech. Rep. TR-1530, Jul. 2008.
X. Zhao, L. Chen and K. Aihara, “Protein function prediction with high-throughput data”, Amino Acids, vol. 35, no. 3, pp. 517-530, 2008.
X. Zhao, Y. Wang, L. Chen and K. Aihara, “Gene function prediction using labeled and unlabeled data”, BMC Bioinformatics , vol. 9, no. 57, pp. 1-14, 2008.
O. Chapelle, B. Schölkopf and A. Zien , Semi-supervised learning, 1 st ed. Cambridge, USA: MIT Press, 2006.
X. Zhu and A. Goldberg, Introduction to semi-supervised learning , 1 st ed. Madison, USA: Morgan & Claypool, 2009.
N. Kasabov and S. Pang, “Transductive support vector machines and applications in bioinformatics for promoter recognition”, in Int. Conf. on Neural Networks and Signal Processing, Nanjing, China, 2003, pp. 1-6.
T. Li, S. Zhu, Q. Li and M. Ogihara, “Gene functional classification by semisupervised learning from heterogeneous data”, in ACM Symposium on Applied Computing (SAC), Melbourne, USA, 2003, pp. 78-82.
M. Krogel and T. Scheffer, “Multi-relational learning, text mining, and semisupervised learning for functional genomics” , Machine Learning , vol. 57, no. 1, pp. 61-81, 2004.
H. Shin and K. Tsuda, “Prediction of protein function from networks”, in Semi-supervised learning, 1 st ed., O. Chapelle, B. Schölkopf and A. Zien (eds). Cambridge, USA: MIT Press, 2006, pp. 339-352.
B. King and C. Guda, “Semi-supervised learning for classification of protein sequence data”, Scientific Programming, vol. 16, no. 1, pp. 5-29, 2008.
H. Shin, K. Tsuda and B. Scholkopf, “Protein functional class prediction with a combined graph”, Expert Systems with Applications, vol. 36, no. 2, pp. 3284- 3292, 2009.
J. Jaramillo and C. Castellanos, “Improving protein sub-cellular localization prediction through semi- supervised learning”, in BIOTECHNO: 6 th International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies, Chamonix, France, 2014, pp. 99-103.
F. Cozman, I. Cohen and M. Cirelo, “Semi-supervised learning of mixture models”, in 20 th International Conference on Machine Learning (ICML), Washington D.C., USA, 2003, pp. 99-106.
D. Miller and H. Uyar, “A generalized gaussian mixture classifier with learning based on both labelled and unlabelled data”, in Conference on Information Science and Systems , Princeton, USA, 1996, pp. 783-787.
G. McLachlan and T. Krishnan, The EM algorithm and extensions, 2 nd ed. St. Lucia, Australia: John Wiley & Sons, 2007.
K. Nigam, A. McCallum, S. Thrun and T. Mitchell, “Text classification from labeled and unlabeled documents using EM”, Machine learning, vol. 39, no. 2, pp. 103- 134, 2000.
A. Fujino, N. Ueda and K. Saito, “A hybrid generative/ discriminative approach to semi-supervised classifier design”, in 20 th National Conference on Artificial Intelligence (AAAI), Pittsburgh, USA, 2005, pp. 764-769.
X. Zhu and J. Lafferty, “Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning”, in 22 nd International Conference on Machine Learning (ICML), Bonn, Germany, 2005, pp. 1052-1059.
O. Chapelle, M. Chi and A. Zien, “A continuation method for semi-supervised SVMs”, in 23 rd international conference on Machine learning (ICML) , Pittsburgh, USA, 2006, pp. 185-192.
O. Chapelle, V. Sindhwani and S. Keerthi, “Optimization techniques for semi-supervised support vector machines”, Journal of Machine Learning Research , vol. 9, pp. 203-233, 2008.
T. Joachims, “Transductive inference for text classification using support vector machines”, in 16 th International Conference on Machine Learning (ICML), Bled, Slovenia, 1999, pp. 200-209.
O. Chapelle and A. Zien, “Semi-supervised classification by low density separation”, in 10 th Int. Workshop on Artificial Intelligence and Statistics (AISTATS), Bridgetown, Barbados, 2005, pp. 57-64.
R. Collobert, F. Sinz, J. Weston and L. Bottou, “Large scale transductive SVMs”, Journal of Machine Learning Research , vol. 7, pp. 1687-1712, 2006.
Y. Li, J. Kwok and Z. Zhou, “Cost-Sensitive Semi- Supervised Support Vector Machine”, in 24 th Conference on Artificial Intelligence (AAAI), Atlanta, USA, 2010, pp. 500-505.
Z. Qi, Y. Tian and Y. Shi, “Laplacian twin support vector machine for semi-supervised classification”, Neural networks, vol. 35, pp. 46-53, 2012.
Z. Xu et al ., “Adaptive regularization for transductive support vector machine”, in Advances in Neural Information Processing Systems 22 (NIPS), Vancouver, Canada, 2009, pp. 2125-2133.
Z. Wang, S. Yan and C. Zhang, “Active learning with adaptive regularization”, Pattern Recognition , vol. 44, no. 10-11, pp. 2375-2383, 2011.
M. Hein, J. Audibert and U. Luxburg, “From graphs to manifolds-weak and strong pointwise consistency of graph Laplacians”, in 18 th Annual Conference on Learning Theory (COLT), Bertinoro, Italy, 2005, pp. 470-485.
M. Belkin, P. Niyogi and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples”, Journal of Machine Learning Research, vol. 7, pp. 2399-2434, 2006.
F. Sinz, O. Chapelle, A. Agarwal and B. Schölkopf, “An analysis of inference with the universum”, in Conference on Advances in Neural Information Processing Systems 20 (NIPS), Vancouver, Canada, 2007, pp. 1369-1376.
E. Jain et al., “Infrastructure for the life sciences: design and implementation of the UniProt website”, BMC Bioinformatics , vol. 10, no. 136, pp. 1-19, 2009.
D. Barrell et al ., “The GOA database in 2009-an integrated Gene Ontology Annotation resource”, Nucleic Acids Research , vol. 37, pp. 396-403, 2009.
W. Li and A. Godzik, “Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences”, Bioinformatics, vol. 22, no. 13, pp. 1658-1659, 2006.
T. Berardini et al., “Functional annotation of the Arabidopsis genome using controlled vocabularies”, vol. 135, no. 2, pp. 745-755, 2004.
Copyright (c) 2016 Revista Facultad de Ingeniería Universidad de Antioquia
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Authors can archive the pre-print version (i.e., the version prior to peer review) and post-print version (that is, the final version after peer review and layout process) on their personal website, institutional repository and / or thematic repository
- Upon acceptance of an article, it will be published online through the page https://revistas.udea.edu.co/index.php/ingenieria/issue/archive in PDF version with its correspondent DOI identifier
The Revista Facultad de Ingeniería -redin- encourages the Political Constitution of Colombia, chapter IV
Chapter IV Sanctions 51
The following shall be liable to imprisonment for two to five years and a fine of five to 20 times the legal minimum monthly wage: (1) any person who publishes an unpublished literary or artistic work, or part thereof, by any means, without the express prior authorization of the owner of rights; (2) any person who enters in the National Register of Copyright a literary, scientific or artistic work in the name of a person other than the true author, or with its title altered or deleted, or with its text altered, deformed, amended or distorted, or with a false mention of the name of the publisher or phonogram, film, videogram or software producer; (3) any person who in any way or by any means reproduces, disposes of, condenses, mutilates or otherwise transforms a literary, scientific or artistic work without the express prior authorization of the owners thereof; (4) any person who reproduces phonograms, videograms, software or cinematographic works without the express prior authorization of the owner, or transports, stores, stocks, distributes, imports, sells, offers for sale, acquires for sale or distribution or in any way deals in such reproductions. Paragraph. If either the material embodiment or title page of or the introduction to the literary work, phonogram, videogram, software or cinematographic work uses the name, business style, logotype or distinctive mark of the lawful owner of rights, the foregoing sanctions shall be increased by up to half.