Applicability of semi-supervised learning assumptions for gene ontology terms prediction

Jorge Alberto Jaramillo-Garzón; César Germán Castellanos-Domínguez; Alexandre Perera-Lluna

doi:10.17533/udea.redin.n79a03

Autores/as

Jorge Alberto Jaramillo-Garzón Instituto Tecnológico Metropolitano https://orcid.org/0000-0003-3195-7588
César Germán Castellanos-Domínguez Universidad Nacional de Colombia https://orcid.org/0000-0002-0138-5489
Alexandre Perera-Lluna Universidad Politécnica de Cataluña https://orcid.org/0000-0001-6427-851X

DOI:

https://doi.org/10.17533/udea.redin.n79a03

Palabras clave:

aprendizaje semi-supervisado, ontología genética, máquinas de vectores de soporte, predicción de funciones proteicas

Resumen

La Ontología Genética (GO) es uno de los recursos más importantes en la bioinformática, el cual busca proporcionar un marco de trabajo unifcado para la anotación biológica de genes y proteínas de todas las especies. La predicción de términos GO es una tarea esencial en bioinformática, pero el número de secuencias etiquetadas que se encuentran disponibles es insufciente en muchos casos para entrenar sistemas confables de aprendizaje de máquina. El aprendizaje semi-supervisado aparece entonces como una poderosa solución que explota la información contenida en los datos no etiquetados, con el fin de mejorar las estimaciones de las aplicaciones supervisadas tradicionales. Sin embargo, los métodos semi-supervisados deben hacer suposiciones fuertes sobre la naturaleza de los datos de entrenamiento y, por lo tanto, el desempeño de los predictores es altamente dependiente de estas suposiciones. En este artículo se presenta un análisis de la aplicabilidad de las diferentes suposiciones del aprendizaje semi-supervisado en la tarea específca de predicción de términos GO, con el fn de proveer elementos de juicio que permitan escoger las herramientas más adecuadas para términos GO específcos. Los resultados muestran que los métodos semi-supervisados superan signifcativamente a los métodos tradicionales supervisados y que los desempeños más altos son alcanzados cuando se implementa la suposición de cluster. Además se comprueba experimentalmente que las suposiciones de cluster y manifold son complementarias entre sí y se realiza un análisis de cuáles términos GO pueden ser más susceptibles de ser correctamente predichos usando cada una de éstas.

|Resumen

= 582 veces | PDF (ENGLISH)

= 218 veces|

Descargas

Los datos de descargas todavía no están disponibles.

Biografía del autor/a

Jorge Alberto Jaramillo-Garzón, Instituto Tecnológico Metropolitano

Profesor asistente. Grupo de Automática, Electrónica y Ciencias Computacionales, Facultad de Ingenierías.

César Germán Castellanos-Domínguez, Universidad Nacional de Colombia

Profesor titular. Departamento de Ingeniería Eléctrica, Electrónica y Computación, Facultad de Ingeniería y Arquitectura.

Alexandre Perera-Lluna, Universidad Politécnica de Cataluña

Centro de Investigación en Ingeniería Biomédica.

Citas

K. Chou and H. Shen, “Recent progress in protein subcellular location prediction”, Analytical Biochemistry , vol. 370, no. 1, pp. 1-16, 2007.

P. Benfey and T. Mitchell, “From Genotype to Phenotype: Systems Biology Meets Natural Variation”, Science , vol. 320, no. 5875, pp. 495-497, 2008.

M. Harris et al., “The gene ontology (GO) database and informatics resource”, Nucleic Acids Res. , vol. 32, pp. 258-261, 2004.

J. Jaramillo, J. Gallardo, C. Castellanos and A. Perera, “Predictability of gene ontology slim-terms from primary structure information in Embryophyta plant proteins”, BMC Bioinformatics, vol. 14, no. 68, pp. 1-11, 2013.

X. Zhu, “Semi-Supervised Learning Literature Survey”, University of Wisconsin–Madison, Madison, USA, Tech. Rep. TR-1530, Jul. 2008.

X. Zhao, L. Chen and K. Aihara, “Protein function prediction with high-throughput data”, Amino Acids, vol. 35, no. 3, pp. 517-530, 2008.

X. Zhao, Y. Wang, L. Chen and K. Aihara, “Gene function prediction using labeled and unlabeled data”, BMC Bioinformatics , vol. 9, no. 57, pp. 1-14, 2008.

O. Chapelle, B. Schölkopf and A. Zien , Semi-supervised learning, 1 st ed. Cambridge, USA: MIT Press, 2006.

X. Zhu and A. Goldberg, Introduction to semi-supervised learning , 1 st ed. Madison, USA: Morgan & Claypool, 2009.

N. Kasabov and S. Pang, “Transductive support vector machines and applications in bioinformatics for promoter recognition”, in Int. Conf. on Neural Networks and Signal Processing, Nanjing, China, 2003, pp. 1-6.

T. Li, S. Zhu, Q. Li and M. Ogihara, “Gene functional classification by semisupervised learning from heterogeneous data”, in ACM Symposium on Applied Computing (SAC), Melbourne, USA, 2003, pp. 78-82.

M. Krogel and T. Scheffer, “Multi-relational learning, text mining, and semisupervised learning for functional genomics” , Machine Learning , vol. 57, no. 1, pp. 61-81, 2004.

H. Shin and K. Tsuda, “Prediction of protein function from networks”, in Semi-supervised learning, 1 st ed., O. Chapelle, B. Schölkopf and A. Zien (eds). Cambridge, USA: MIT Press, 2006, pp. 339-352.

B. King and C. Guda, “Semi-supervised learning for classification of protein sequence data”, Scientific Programming, vol. 16, no. 1, pp. 5-29, 2008.

H. Shin, K. Tsuda and B. Scholkopf, “Protein functional class prediction with a combined graph”, Expert Systems with Applications, vol. 36, no. 2, pp. 3284- 3292, 2009.

J. Jaramillo and C. Castellanos, “Improving protein sub-cellular localization prediction through semi- supervised learning”, in BIOTECHNO: 6 th International Conference on Bioinformatics, Biocomputational Systems and Biotechnologies, Chamonix, France, 2014, pp. 99-103.

F. Cozman, I. Cohen and M. Cirelo, “Semi-supervised learning of mixture models”, in 20 th International Conference on Machine Learning (ICML), Washington D.C., USA, 2003, pp. 99-106.

D. Miller and H. Uyar, “A generalized gaussian mixture classifier with learning based on both labelled and unlabelled data”, in Conference on Information Science and Systems , Princeton, USA, 1996, pp. 783-787.

G. McLachlan and T. Krishnan, The EM algorithm and extensions, 2 nd ed. St. Lucia, Australia: John Wiley & Sons, 2007.

K. Nigam, A. McCallum, S. Thrun and T. Mitchell, “Text classification from labeled and unlabeled documents using EM”, Machine learning, vol. 39, no. 2, pp. 103- 134, 2000.

A. Fujino, N. Ueda and K. Saito, “A hybrid generative/ discriminative approach to semi-supervised classifier design”, in 20 th National Conference on Artificial Intelligence (AAAI), Pittsburgh, USA, 2005, pp. 764-769.

X. Zhu and J. Lafferty, “Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning”, in 22 nd International Conference on Machine Learning (ICML), Bonn, Germany, 2005, pp. 1052-1059.

O. Chapelle, M. Chi and A. Zien, “A continuation method for semi-supervised SVMs”, in 23 rd international conference on Machine learning (ICML) , Pittsburgh, USA, 2006, pp. 185-192.

O. Chapelle, V. Sindhwani and S. Keerthi, “Optimization techniques for semi-supervised support vector machines”, Journal of Machine Learning Research , vol. 9, pp. 203-233, 2008.

T. Joachims, “Transductive inference for text classification using support vector machines”, in 16 th International Conference on Machine Learning (ICML), Bled, Slovenia, 1999, pp. 200-209.

O. Chapelle and A. Zien, “Semi-supervised classification by low density separation”, in 10 th Int. Workshop on Artificial Intelligence and Statistics (AISTATS), Bridgetown, Barbados, 2005, pp. 57-64.

R. Collobert, F. Sinz, J. Weston and L. Bottou, “Large scale transductive SVMs”, Journal of Machine Learning Research , vol. 7, pp. 1687-1712, 2006.

Y. Li, J. Kwok and Z. Zhou, “Cost-Sensitive Semi- Supervised Support Vector Machine”, in 24 th Conference on Artificial Intelligence (AAAI), Atlanta, USA, 2010, pp. 500-505.

Z. Qi, Y. Tian and Y. Shi, “Laplacian twin support vector machine for semi-supervised classification”, Neural networks, vol. 35, pp. 46-53, 2012.

Z. Xu et al ., “Adaptive regularization for transductive support vector machine”, in Advances in Neural Information Processing Systems 22 (NIPS), Vancouver, Canada, 2009, pp. 2125-2133.

Z. Wang, S. Yan and C. Zhang, “Active learning with adaptive regularization”, Pattern Recognition , vol. 44, no. 10-11, pp. 2375-2383, 2011.

M. Hein, J. Audibert and U. Luxburg, “From graphs to manifolds-weak and strong pointwise consistency of graph Laplacians”, in 18 th Annual Conference on Learning Theory (COLT), Bertinoro, Italy, 2005, pp. 470-485.

M. Belkin, P. Niyogi and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples”, Journal of Machine Learning Research, vol. 7, pp. 2399-2434, 2006.

F. Sinz, O. Chapelle, A. Agarwal and B. Schölkopf, “An analysis of inference with the universum”, in Conference on Advances in Neural Information Processing Systems 20 (NIPS), Vancouver, Canada, 2007, pp. 1369-1376.

E. Jain et al., “Infrastructure for the life sciences: design and implementation of the UniProt website”, BMC Bioinformatics , vol. 10, no. 136, pp. 1-19, 2009.

D. Barrell et al ., “The GOA database in 2009-an integrated Gene Ontology Annotation resource”, Nucleic Acids Research , vol. 37, pp. 396-403, 2009.

W. Li and A. Godzik, “Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences”, Bioinformatics, vol. 22, no. 13, pp. 1658-1659, 2006.

T. Berardini et al., “Functional annotation of the Arabidopsis genome using controlled vocabularies”, vol. 135, no. 2, pp. 745-755, 2004.

Aplicabilidad de las suposiciones del aprendizaje semi-supervisado para la predicción de términos de la ontología genética

Autores/as

DOI:

Palabras clave:

Resumen

Descargas

Biografía del autor/a

Jorge Alberto Jaramillo-Garzón, Instituto Tecnológico Metropolitano

César Germán Castellanos-Domínguez, Universidad Nacional de Colombia

Alexandre Perera-Lluna, Universidad Politécnica de Cataluña

Citas

Descargas

Publicado

Cómo citar

Número

Sección

Licencia

Eres libre de:

Bajo los siguientes términos:

Artículos más leídos del mismo autor/a

Palabras clave

Idioma

Información

Número actual