Cadlaws: un corpus paralelo de documentos jurídicos equivalentes inglés-francés
DOI:
https://doi.org/10.17533/udea.mut.v14n2a10Palabras clave:
construcción de corpus, corpus paralelo, traducción automática neuronal (TA), inglés-francés, CadlawsResumen
Este artículo presenta Cadlaws, un nuevo corpus en los pares de lenguas inglés y francés, creado con base en documentos legales canadienses. Describe el proceso de construcción del corpus y las estadísticas preliminares que se obtuvieron de aquél. Este corpus contiene más de 16 millones de vocablos en cada idioma e incluye características únicas, pues está conformado por documentos equivalentes desde el punto de vista jurídico en ambos idiomas como lengua de partida. El corpus se basó en autos legales redactados de manera conjunta por dos juristas para garantizar la equivalencia jurídica de cada versión y reflejar los conceptos, términos e instituciones de dos tradiciones del derecho. En este artículo, también se estudia la definición de corpus como corpus paralelo en oposición al corpus comparable. Cadlaws se procesó previamente para traducción automática y el suplente de evaluación bilingüe de referencia (bleu, por sus siglas en inglés), un puntaje que sirve para comparar un texto presentado como candidato para la traducción de un texto contra una traducción considerada patrón de referencia en un sistema de traducción automática neuronal. Hasta donde sabemos, este es el corpus paralelo de textos con el mismo significado en este par de lenguas más extenso que existe, y ofrece libre acceso para uso no comercial.
Descargas
Citas
Allard, F. (2001). The Supreme Court of Canada and its impact on the expression of bijuralism. The Harmonization of federal legislation with the civil law of the province of Quebec and Canadian bijuralism (Second Publication), Booklet 3, Ot-tawa, Department of Justice Canada.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. Conference paper presented at the 2015 International Conference on Learn¬ing Representations —icrl—. https://arxiv. org/abs/1409.0473
Baker, M. (1993). Corpus linguistics and transla¬tion studies: Implications and applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology: In honour of John Sinclair (pp. 233-252). John Benjamins. https://doi.org/10.1075/z.64.15bak
Baroni, M. & Bernardini, S. (2005). A new approach to the study of translationese: Ma¬chine-learning the difference between original and translated text. Literary and Lin¬guistic Computing, 21(3), 259–274. https://doi. org/10.1093/llc/fqi039
Canada Government, Department of Justice. (s. f.). Justice Laws [Website]. https://laws-lois.justice.gc.ca/eng/ (Accessed April 7th, 2021).
Carter, D. & Inkpen, D. (2012). Searching for poor quality machine translated text: Learning the difference between human writing and machine translations. In L. Kosseim & D. Inkpen (Eds.), Advances in artificial intelligence (pp. 49– 60). Springer. https://doi.org/10.1007/978-3- 642-30353-1_5
Comparin, L.. (2017). Quality in machine transla¬tion and human post-editing: Error annotation and specifications. [M. A. thesis], Universidade de Lisboa, Lisbon. https://repositorio.ul.pt/handle/10451/27969?mode=full
Cromieres, F., Toshiaki, N., & Raj, D. (2017). Neural machine translation: Basics, practical aspects and recent trends. Proceedings of the ijcnlp 2017, Tutorial Abstracts. Asian Federation of Natural Language Processing, Taipei, Taiwan, 11-13.
Daems, J., De Clercq, O., & Macken, L. (2017). Translationese and post-editese: How comparable is comparable quality? Linguistica Antverpiensia, New Series: Themes in Translation Stud¬ies, 16, 89–103.
Daems, J., Vandepitte, S., Hartsuiker, R. J., & Macken, L. (2017). Identifying the machine translation error types with the greatest impact on post-editing effort.
Frontiers in Psychology, 8, 1282. https://doi.org/10.3389/fpsyg.2017.01282
Farrús, M., Costa-Jussà, M. R., Mariño, J. B., Poch, M., Hernández, A., Henríquez, C., & Fonollosa, J. A. (2011). Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair. Language Resources and Evalu¬ation, 45, 181-208. https://doi.org/10.1007/ s10579-011-9137-0
Federal Law-Civil Law Act, N.°1. Royal Assent. Government of Canada https://www.parl. ca/DocumentViewer/en/37-1/bill/S-4/royal-assent (Accessed April 7th, 2021).
Freitag, M., Caswell, I., & Roy, S. (2019). ape at scale and its implications on mt evaluation bi¬ases. Fourth Conference on Machine Translation (wmt) (vol. 2, pp. 34–44). Florence. https:// doi.org/10.18653/v1/W19-5204
Germann, U. (2001). Aligned Hansard of the 36th Parliament of Canada. Natural Language Group of the usc Information Sciences Institute. https://www.isi.edu/natural-language/ download/Hansard/ (15th December, 2019).
Gervais, M.-F. & Séguin, M.-C. (2001). Some thoughts on bijuralism in Canada and the world. In Canada, Department of Justice, The harmonization of federal legislation with the civil law of the province of Quebec and Canadian bijuralism. Ottawa, Department of Justice. https://www.justice.gc.ca/eng/rp-pr/csj-sjc/ harmonization/hfl-hlf/b2-f2/bf2.pdf (Accessed April 7th, 2021).
Hansen-Schirra, S. (2011). Between normalization and shining-through: Specific properties of English-German translations and their influence on the target language. In S. Kranich, V. Becher, S. Höder, & J. House (Eds.), Hamburg Studies on Multilingualism (pp. 133–162). John Benjamins Publishing Company. https://doi.org/10.1075/hsm.12.07han
Hewavitharana, S. & Vogel, S. (2016). Extracting parallel phrases from comparable data for machine translation. Natural Language Engineering, 22(4), 549–573. https://doi.org/10.1017/ S1351324916000139
Ilisei, I., Inkpen, D., Corpas Pastor, G., & Mitkov, R. (2010). Identification of translationese: A machine learning approach. In A. Gelbukh (Ed.), Computational linguistics and intelligent text processing. 11th International Conference, CICLing 2010, Iaşi, Romania, 21–27 March. Proceedings (pp. 503–511). Springer. https://doi.org/10.1007/978-3-642-12116-6_43
Jiang, Z. & Tao, Y. (2017). Translation universals of discourse markers in Russian-to-Chinese academic texts: A corpus-based approach. Zeitschrift für Slawistik, 62(1), 583–605. https://doi.org/10.1515/slaw-2017-0037
Klein, G., Kim, Y., Deng, Y., Nguyen, V., Senel¬lart, J., & Rush, A. (2018). Opennmt: Neural Machine Translation Toolkit. Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (pp. 177–184, vol. 1, Research Papers). amta.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. Conference Proceedings: The Tenth Machine Translation Summit (pp. 79–86). Phuket, Thailand, aamt.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007).
Moses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion (177–180). Prague, acl. https:// doi.org/10.3115/1557769.1557821
Koppel, M., & Ordan, N. (2011). Translationese and its dialects. Paper presented at the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland.
Kurokawa, D., Goutte, C., Isabelle, P. (2009). Automatic detection of translated text and its impact on machine translation. Proceedings of The Twelfth Machine Translation Summit International Association for Machine Translation (pp. 81–88). Ottawa: Association for Machine Translation in the Americas.
Laippala, V., Kanerva, J., Missil, A., Missilä, A., Pyysalo, S., Salakoski, T., & Ginter, F. (2015). Towards the classification of the Finnish Internet Parsebank: Detecting translations and informality. In Nodalida. Linköping University Electronic Press.
Laviosa, S. (1998). Core patterns of lexical use in a comparable corpus of English narrative prose. Meta, 43(4), 557–570. https://doi.org/10.7202/003425ar
Lin, C.-Y., Och, F. J. (2004). Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistic. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (pp. 605–612). Barcelona: acl. https://doi.org/10.3115/1218955.1219032
Mauranen, A. (2004). Corpora, universals and interference. In A. Mauranen, P.
Kujamäki. (Eds.), Benjamins translation library (pp. 65– 82). John Benjamins Publishing Company. https://doi.org/10.1075/btl.48.07mau
McEnergy, A. (2003). Corpus linguistics. In R. Mitkov (Ed.), Oxford handbook of computational lingustics. Oxford University Press.
McLaren, K. (2014). Bilinguisme législatif : regard sur l’interprétation et la rédaction des lois bilingues au Canada. Ottawa Law Review, 45(1), 21–37.
Neubig, G. (2017). Neural machine translation and sequence-to-sequence models: A tutorial. https://arx-iv.org/abs/1703.01619 (15th December, 2019).
Olohan, M. (2003). How frequent are the contractions?: A study of contracted forms in the Translational English Corpus. Target, 15, 59–89. https://doi.org/10.1075/target.15.1.04olo
Papineni, K., Roukos, S., Ward, T., Zhu, W. J. (2001). bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 311-318). Philadelphia, acl. https://doi.org/10.3115/1073083.1073135
Policy on legislative bijuralism. (1995). https://www.justice.gc.ca/eng/csj-sjc/harmonization/bijurilex/policy-politique.html (Accessed April 7th, 2021).
Popescu-Bels, A. (2019). Context in neural machine translation: A review of models and evaluations. https://arxiv.org/abs/1901.09115 (Access: 15th December, 2019).
Puurtinen, T. (2003). Genre-specific features of translationese? Linguistic differences between translated and non-translated Finnish children’s literature. Literary and Linguistic Computing, 18, 389-406. https://doi. org/10.1093/llc/18.4.389
Rabinovich, E., Ordan, N., & Wintner, S. (2017). Found in translation: Reconstructing phylogenetic language trees from translations (pp. 530–540). Association for Computational Linguistics. https://doi.org/10.18653/v1/P17-1049
Rapp, R., Sharoff, S., & Zweigenbaum, P. (2016). Recent advances in machine translation us¬ing comparable corpora. Natural Language Engineering, 22(4), 501–516. https://doi.org/10.1017/S1351324916000115
Regan, V., Lemée, I., & Conrick, M. (2011). Multiculturalism and integration: Canadian and Irish experiences. University of Ottawa Press.
Sánchez-Gijon, Pilar, Piqué, Ramon. (2020). nmt and the indivisibility of culture and language. ciuti Conference 2020. Artificial Intelligence & Intercultural Intelligence. Paris, 9–11 December. ciutiisit.
Schwenk, H., Wenzek, G., Edunov, S., Grave, E., Joulin, A. (2019). CCMatrix: Mining billions of high-quality parallel sentences on the web. https://arxiv.org/pdf/1911.04944.pdf (December 15th, 2019).
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., & Varga, D. (2006). The jrc-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation. elra, 2142-2147.
Stahlberg, F. (2020). Neural machine translation: A review. Journal of Artificial Intelligence Research, 69, 343–418. https://doi.org/10.1613/ jair.1.12007
Sutskever, I., Vinyals, O., Le Quoc, V. (2014). Sequence to sequence learning with neural networks. nips’14: Proceedings of the 27th International Conference on Neural Information Processing Systems. (vol. 2, pp. 3104-3112).
Tirkkonen-Condit, S. (2002). Translationese —A myth or an empirical fact?: A study into the linguistic identifiability of translated language. Target, 14, 207-220. https://doi.org/10.1075/target.14.2.02tir
van Halteren, H. (2008). Source language markers in europarl translations. Proceedings of the 22nd International Conference on Computational Linguistics —coling ’08 (vol. 1, pp. 937–944). Stroudsburg, Association for Computational Linguistics. https://doi.org/10.3115/1599081.1599199
Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V., Nagy, V. (2005). Parallel corpora for medium density languages. Recent Advances in Natural Language Processing iv (pp. 247-258), Selected papers from ranlp. https://doi.org/10.1075/cilt.292.32var
Volansky, V., Ordan, N., & Wintner, S. (2015). On the features of translationese. Digital Scholarship in the Humanities, 30(1), 98-118. https://doi.org/10.1093/llc/fqt031
Way, A. (2018). Quality expectations of machine translation. In J. Moorkens, S. Castilho, F. Gaspari, & S. Doherty (Eds.), Translation quality assessment. Machine translation: Technologies and applications (vol. 1). Springer, Cham.
Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (2016). The United Nations parallel corpus, language resources and evaluation. Proceedings of the Tenth International Conference on Language Resources and Evaluation (pp. 3530-3534). PortoroŽ, Slovenia.
Descargas
Publicado
Cómo citar
Número
Sección
Licencia
Derechos de autor 2021 Mutatis Mutandis. Revista Latinoamericana de Traducción
Esta obra está bajo una licencia internacional Creative Commons Atribución-NoComercial-CompartirIgual 4.0.
Aquellos autores/as que tengan publicaciones con esta revista, aceptan los términos siguientes:
- La revista es el titular de los derechos de autor de los artículos, los cuales estarán simultáneamente sujetos a la Licencia de reconocimiento no comercial sin obra derivada de Creative Commons que permite a terceros compartir la obra siempre que se indique su autor y su primera publicación esta revista.
- Los autores/as podrán adoptar otros acuerdos de licencia no exclusiva de distribución de la versión de la obra publicada (p. ej.: depositarla en un archivo telemático institucional o publicarla en un volumen monográfico) siempre que se indique la publicación inicial en esta revista.
- Se permite y recomienda a los autores/as difundir su obra a través de Internet (p. ej.: en archivos telemáticos institucionales o en su página web) antes y durante el proceso de envío, lo cual puede producir intercambios interesantes y aumentar las citas de la obra publicada.