ESTRUCTURACIÓN Y CLASIFICACIÓN AUTOMÁTICA DE INFORMACIÓN: APLICACIÓN A UNA COLECCIÓN DE TEXTOS MEDICOS

Jorge Morato; José Antonio Moreiro; Juan Llorens; Manuel Velasco

doi:10.17533/udea.rib.7880

Autores/as

Jorge Morato
José Antonio Moreiro
Juan Llorens
Manuel Velasco

DOI:

https://doi.org/10.17533/udea.rib.7880

Resumen

Se describe una herramienta que mediante una aproximación multidimensional permite la estructuración y clasificación de textos. El fin que se persigue es el estudio de las distintas secciones del documento. En el desarrollo del modulo se emplearon algoritmos de filtrado (N-grams) y de clasificación (K-means y chen). La estructuración de los documentos se realizó mediante marcadores linguísticos, tipográficos y herramientas estadísticas. Para la evaluación del método se recopilaron de Medline documentos médicos o texto completo y se incorporó una herramienta de comparación, el MeSH Mediante un análisis estadístico y comparativo, se ha comprobado la necesidad y validez de este tipo de aproximaciones. Por último, se propone la integración del método en un módulo que optimice la asignación de pesos en el diseño de herramientas de clasificación y recuperación documental.

Palabras clave: lingüística estructural, discurso científico, lingüística del texto, documentación automatizada, documentación científica, análisis documental, medicina, análisis de clusters, clasificación estadística.

Abstract

In this study, an automatic linguistic tool is described. The goal of this tool is to analyse the behavior of different text structures when they are faced to filtering and classification algoritms. The model structures the text by means of a multidimensional approach. On one hand, text has been divided in sections by means of typographic constraints, semantic labels, and location rules. On the other, vocabulary related to different text structures has been implemented in the database. The text analysis algorithms that have been implemented were the n-grams filter, and the classification algorithms k-means and Chen co-wording. The module has been tested usign a collection of full-text documents from Medline. The evaluation of the methodology was accomplished by comparing with the MeSH vocabulary and a statistical analysis. This study had shown some advantages of the context approach. Finally, it is proposed to improve the success of information retrieval and classification algorithms with structuring techniques.

Keywords: Automatic-text-structuring, discourse-model, computational-linguistics, text-analysis-methods, automatic-classification, cluster-analysis, medicine, linguistics.

|Resumen

= 332 veces | PDF

= 57 veces|