Applicability of semi-supervised learning assumptions for gene ontology terms prediction
Keywords:semi-supervised learning, gene ontology, support vector machines, protein function prediction
Gene Ontology (GO) is one of the most important resources in bioinformatics, aiming to provide a unified framework for the biological annotation of genes and proteins across all species. Predicting GO terms is an essential task for bioinformatics, but the number of available labelled proteins is in several cases insufficient for training reliable machine learning classifiers. Semi-supervised learning methods arise as a powerful solution that explodes the information contained in unlabelled data in order to improve the estimations of traditional supervised approaches. However, semi-supervised learning methods have to make strong assumptions about the nature of the training data and thus, the performance of the predictor is highly dependent on these assumptions. This paper presents an analysis of the applicability of semi-supervised learning assumptions over the specific task of GO terms prediction, focused on providing judgment elements that allow choosing the most suitable tools for specific GO terms. The results show that semi-supervised approaches significantly outperform the traditional supervised methods and that the highest performances are reached when applying the cluster assumption. Besides, it is experimentally demonstrated that cluster and manifold assumptions are complementary to each other and an analysis of which GO terms can be more prone to be correctly predicted with each assumption, is provided.
K. Chou and H. Shen, “Recent progress in protein subcellular location prediction”, Analytical Biochemistry , vol. 370, no. 1, pp. 1-16, 2007.
P. Benfey and T. Mitchell, “From Genotype to Phenotype: Systems Biology Meets Natural Variation”, Science , vol. 320, no. 5875, pp. 495-497, 2008.
M. Harris et al., “The gene ontology (GO) database and informatics resource”, Nucleic Acids Res. , vol. 32, pp. 258-261, 2004.
J. Jaramillo, J. Gallardo, C. Castellanos and A. Perera, “Predictability of gene ontology slim-terms from primary structure information in Embryophyta plant proteins”, BMC Bioinformatics, vol. 14, no. 68, pp. 1-11, 2013.
X. Zhu, “Semi-Supervised Learning Literature Survey”, University of Wisconsin–Madison, Madison, USA, Tech. Rep. TR-1530, Jul. 2008.
X. Zhao, L. Chen and K. Aihara, “Protein function prediction with high-throughput data”, Amino Acids, vol. 35, no. 3, pp. 517-530, 2008.
X. Zhao, Y. Wang, L. Chen and K. Aihara, “Gene function prediction using labeled and unlabeled data”, BMC Bioinformatics , vol. 9, no. 57, pp. 1-14, 2008.
O. Chapelle, B. Schölkopf and A. Zien , Semi-supervised learning, 1 st ed. Cambridge, USA: MIT Press, 2006.
X. Zhu and A. Goldberg, Introduction to semi-supervised learning , 1 st ed. Madison, USA: Morgan & Claypool, 2009.
N. Kasabov and S. Pang, “Transductive support vector machines and applications in bioinformatics for promoter recognition”, in Int. Conf. on Neural Networks and Signal Processing, Nanjing, China, 2003, pp. 1-6.
T. Li, S. Zhu, Q. Li and M. Ogihara, “Gene functional classification by semisupervised learning from heterogeneous data”, in ACM Symposium on Applied Computing (SAC), Melbourne, USA, 2003, pp. 78-82.
M. Krogel and T. Scheffer, “Multi-relational learning, text mining, and semisupervised learning for functional genomics” , Machine Learning , vol. 57, no. 1, pp. 61-81, 2004.
H. Shin and K. Tsuda, “Prediction of protein function from networks”, in Semi-supervised learning, 1 st ed., O. Chapelle, B. Schölkopf and A. Zien (eds). Cambridge, USA: MIT Press, 2006, pp. 339-352.
B. King and C. Guda, “Semi-supervised learning for classification of protein sequence data”, Scientific Programming, vol. 16, no. 1, pp. 5-29, 2008.
H. Shin, K. Tsuda and B. Scholkopf, “Protein functional class prediction with a combined graph”, Expert Systems with Applications, vol. 36, no. 2, pp. 3284- 3292, 2009.
J. Jaramillo and C. Castellanos, “Improving protein sub-cellular localization prediction through semi- supervised learning”, in BIOTECHNO: 6 th International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies, Chamonix, France, 2014, pp. 99-103.
F. Cozman, I. Cohen and M. Cirelo, “Semi-supervised learning of mixture models”, in 20 th International Conference on Machine Learning (ICML), Washington D.C., USA, 2003, pp. 99-106.
D. Miller and H. Uyar, “A generalized gaussian mixture classifier with learning based on both labelled and unlabelled data”, in Conference on Information Science and Systems , Princeton, USA, 1996, pp. 783-787.
G. McLachlan and T. Krishnan, The EM algorithm and extensions, 2 nd ed. St. Lucia, Australia: John Wiley & Sons, 2007.
K. Nigam, A. McCallum, S. Thrun and T. Mitchell, “Text classification from labeled and unlabeled documents using EM”, Machine learning, vol. 39, no. 2, pp. 103- 134, 2000.
A. Fujino, N. Ueda and K. Saito, “A hybrid generative/ discriminative approach to semi-supervised classifier design”, in 20 th National Conference on Artificial Intelligence (AAAI), Pittsburgh, USA, 2005, pp. 764-769.
X. Zhu and J. Lafferty, “Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning”, in 22 nd International Conference on Machine Learning (ICML), Bonn, Germany, 2005, pp. 1052-1059.
O. Chapelle, M. Chi and A. Zien, “A continuation method for semi-supervised SVMs”, in 23 rd international conference on Machine learning (ICML) , Pittsburgh, USA, 2006, pp. 185-192.
O. Chapelle, V. Sindhwani and S. Keerthi, “Optimization techniques for semi-supervised support vector machines”, Journal of Machine Learning Research , vol. 9, pp. 203-233, 2008.
T. Joachims, “Transductive inference for text classification using support vector machines”, in 16 th International Conference on Machine Learning (ICML), Bled, Slovenia, 1999, pp. 200-209.
O. Chapelle and A. Zien, “Semi-supervised classification by low density separation”, in 10 th Int. Workshop on Artificial Intelligence and Statistics (AISTATS), Bridgetown, Barbados, 2005, pp. 57-64.
R. Collobert, F. Sinz, J. Weston and L. Bottou, “Large scale transductive SVMs”, Journal of Machine Learning Research , vol. 7, pp. 1687-1712, 2006.
Y. Li, J. Kwok and Z. Zhou, “Cost-Sensitive Semi- Supervised Support Vector Machine”, in 24 th Conference on Artificial Intelligence (AAAI), Atlanta, USA, 2010, pp. 500-505.
Z. Qi, Y. Tian and Y. Shi, “Laplacian twin support vector machine for semi-supervised classification”, Neural networks, vol. 35, pp. 46-53, 2012.
Z. Xu et al ., “Adaptive regularization for transductive support vector machine”, in Advances in Neural Information Processing Systems 22 (NIPS), Vancouver, Canada, 2009, pp. 2125-2133.
Z. Wang, S. Yan and C. Zhang, “Active learning with adaptive regularization”, Pattern Recognition , vol. 44, no. 10-11, pp. 2375-2383, 2011.
M. Hein, J. Audibert and U. Luxburg, “From graphs to manifolds-weak and strong pointwise consistency of graph Laplacians”, in 18 th Annual Conference on Learning Theory (COLT), Bertinoro, Italy, 2005, pp. 470-485.
M. Belkin, P. Niyogi and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples”, Journal of Machine Learning Research, vol. 7, pp. 2399-2434, 2006.
F. Sinz, O. Chapelle, A. Agarwal and B. Schölkopf, “An analysis of inference with the universum”, in Conference on Advances in Neural Information Processing Systems 20 (NIPS), Vancouver, Canada, 2007, pp. 1369-1376.
E. Jain et al., “Infrastructure for the life sciences: design and implementation of the UniProt website”, BMC Bioinformatics , vol. 10, no. 136, pp. 1-19, 2009.
D. Barrell et al ., “The GOA database in 2009-an integrated Gene Ontology Annotation resource”, Nucleic Acids Research , vol. 37, pp. 396-403, 2009.
W. Li and A. Godzik, “Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences”, Bioinformatics, vol. 22, no. 13, pp. 1658-1659, 2006.
T. Berardini et al., “Functional annotation of the Arabidopsis genome using controlled vocabularies”, vol. 135, no. 2, pp. 745-755, 2004.
How to Cite
Copyright (c) 2016 Revista Facultad de Ingeniería Universidad de Antioquia
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Revista Facultad de Ingeniería, Universidad de Antioquia is licensed under the Creative Commons Attribution BY-NC-SA 4.0 license. https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en
You are free to:
Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material
Under the following terms:
Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
NonCommercial — You may not use the material for commercial purposes.
ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
The material published in the journal can be distributed, copied and exhibited by third parties if the respective credits are given to the journal. No commercial benefit can be obtained and derivative works must be under the same license terms as the original work.