Cadlaws - Un corpus parallèle anglais-français de documents juridiquement équivalents

Auteurs-es

  • Francina Sole-Mauri Université autonome de Barcelone
  • Pilar Sánchez-Gijón Université autonome de Barcelone
  • Antoni Oliver Université ouverte de Catalogne

DOI :

https://doi.org/10.17533/udea.mut.v14n2a10

Mots-clés :

construction de corpus, corpus parallèle, traduction automatique neuronale (nmt), anglais, français, Cadlaws

Résumé

Cet article présente Cadlaws, un nouveau corpus anglais-français construit à partir de documents juri­diques canadiens. L’article décrit le processus de construction du corpus ainsi que les statistiques pré­liminaires obtenues à partir de celui-ci. Le corpus contient plus de 16 millions de mots dans chaque langue et présente des caractéristiques uniques puisqu’il est composé de documents juridiquement équivalents dans les deux langues mais qui ne sont pas le résultat d’une traduction. Le corpus est construit à partir de textes co-rédigés par deux juristes afin de garantir l’égalité juridique de chaque version et de refléter les concepts, termes et institutions de deux traditions juridiques. Dans cet article, la définition du corpus comme un corpus parallèle au lieu d’un corpus comparable est également discutée. Cadlaws a été prétraité pour la traduction automatique et offre la valeur d’évaluation bleu (Bilingual Evaluation Understudy), un score permettant de comparer une traduction avec la norme d’un système de traduction automatique neuronal. À notre connaissance, il s’agit du plus grand cor­pus parallèle de textes véhiculant le même sens dans cette paire de langues et il est disponible gratui­tement pour une utilisation non commerciale.

|Résumé
= 573 veces | PDF (ENGLISH)
= 517 veces|

Téléchargements

Les données relatives au téléchargement ne sont pas encore disponibles.

Bibliographies de l'auteur-e

Francina Sole-Mauri, Université autonome de Barcelone

Doctorant du programme doctoral en traduction et études interculturelles de l'Université autonome de Barcelone (UAB). Ses principaux domaines de recherche sont la traduction automatique et la linguistique informatique. Il est membre du projet DESPITE-MT : Décrire PostEditese en traduction automatique (Ministère des Sciences et de l'Innovation).

Pilar Sánchez-Gijón, Université autonome de Barcelone

Diplômée en Langues Modernes Appliquées de l'Université "Babes-Bolyai", Cluj Napoca, Roumanie, Doctorat en Traductologie Spécialisée de l'Université Pompeu Fabra, Barcelone. C'est actuellement une profession permanente du Département de Traduction et d'Interprétation de l'Ecole Nationale de Langue, de la Langue et de la Traduction de l'UNAM, ou de l'enseignement des cours particuliers de traduction, de théories de la traduction, de documentation et de terminologie. . De plus, elle est la coordinatrice du Diplôme en Traduction Juridique Anglais-Espagnol à distance de l'UNAM. Sessions portent sur les études de traduction juridique, les informatiques et documentaires pour le traducteur et l'interprète, lexicographie et terminologie appliquées à la traduction, l'interprétation judiciaire.

Antoni Oliver, Université ouverte de Catalogne

Professeur agrégé d'études en lettres et sciences humaines à l'Université ouverte de Catalogne (UOC) et directeur du master en traduction et technologies de cette université. Ses principaux domaines de recherche sont la traduction automatique et la génération automatique de ressources lexicales et terminologiques.

Références

Allard, F. (2001). The Supreme Court of Canada and its impact on the expression of bijuralism. The Harmonization of federal legislation with the civil law of the province of Quebec and Canadian bijuralism (Second Publication), Booklet 3, Ot-tawa, Department of Justice Canada.

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. Conference paper presented at the 2015 International Conference on Learn¬ing Representations —icrl—. https://arxiv. org/abs/1409.0473

Baker, M. (1993). Corpus linguistics and transla¬tion studies: Implications and applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology: In honour of John Sinclair (pp. 233-252). John Benjamins. https://doi.org/10.1075/z.64.15bak

Baroni, M. & Bernardini, S. (2005). A new approach to the study of translationese: Ma¬chine-learning the difference between original and translated text. Literary and Lin¬guistic Computing, 21(3), 259–274. https://doi. org/10.1093/llc/fqi039

Canada Government, Department of Justice. (s. f.). Justice Laws [Website]. https://laws-lois.justice.gc.ca/eng/ (Accessed April 7th, 2021).

Carter, D. & Inkpen, D. (2012). Searching for poor quality machine translated text: Learning the difference between human writing and machine translations. In L. Kosseim & D. Inkpen (Eds.), Advances in artificial intelligence (pp. 49– 60). Springer. https://doi.org/10.1007/978-3- 642-30353-1_5

Comparin, L.. (2017). Quality in machine transla¬tion and human post-editing: Error annotation and specifications. [M. A. thesis], Universidade de Lisboa, Lisbon. https://repositorio.ul.pt/handle/10451/27969?mode=full

Cromieres, F., Toshiaki, N., & Raj, D. (2017). Neural machine translation: Basics, practical aspects and recent trends. Proceedings of the ijcnlp 2017, Tutorial Abstracts. Asian Federation of Natural Language Processing, Taipei, Taiwan, 11-13.

Daems, J., De Clercq, O., & Macken, L. (2017). Translationese and post-editese: How comparable is comparable quality? Linguistica Antverpiensia, New Series: Themes in Translation Stud¬ies, 16, 89–103.

Daems, J., Vandepitte, S., Hartsuiker, R. J., & Macken, L. (2017). Identifying the machine translation error types with the greatest impact on post-editing effort.

Frontiers in Psychology, 8, 1282. https://doi.org/10.3389/fpsyg.2017.01282

Farrús, M., Costa-Jussà, M. R., Mariño, J. B., Poch, M., Hernández, A., Henríquez, C., & Fonollosa, J. A. (2011). Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair. Language Resources and Evalu¬ation, 45, 181-208. https://doi.org/10.1007/ s10579-011-9137-0

Federal Law-Civil Law Act, N.°1. Royal Assent. Government of Canada https://www.parl. ca/DocumentViewer/en/37-1/bill/S-4/royal-assent (Accessed April 7th, 2021).

Freitag, M., Caswell, I., & Roy, S. (2019). ape at scale and its implications on mt evaluation bi¬ases. Fourth Conference on Machine Translation (wmt) (vol. 2, pp. 34–44). Florence. https:// doi.org/10.18653/v1/W19-5204

Germann, U. (2001). Aligned Hansard of the 36th Parliament of Canada. Natural Language Group of the usc Information Sciences Institute. https://www.isi.edu/natural-language/ download/Hansard/ (15th December, 2019).

Gervais, M.-F. & Séguin, M.-C. (2001). Some thoughts on bijuralism in Canada and the world. In Canada, Department of Justice, The harmonization of federal legislation with the civil law of the province of Quebec and Canadian bijuralism. Ottawa, Department of Justice. https://www.justice.gc.ca/eng/rp-pr/csj-sjc/ harmonization/hfl-hlf/b2-f2/bf2.pdf (Accessed April 7th, 2021).

Hansen-Schirra, S. (2011). Between normalization and shining-through: Specific properties of English-German translations and their influence on the target language. In S. Kranich, V. Becher, S. Höder, & J. House (Eds.), Hamburg Studies on Multilingualism (pp. 133–162). John Benjamins Publishing Company. https://doi.org/10.1075/hsm.12.07han

Hewavitharana, S. & Vogel, S. (2016). Extracting parallel phrases from comparable data for machine translation. Natural Language Engineering, 22(4), 549–573. https://doi.org/10.1017/ S1351324916000139

Ilisei, I., Inkpen, D., Corpas Pastor, G., & Mitkov, R. (2010). Identification of translationese: A machine learning approach. In A. Gelbukh (Ed.), Computational linguistics and intelligent text processing. 11th International Conference, CICLing 2010, Iaşi, Romania, 21–27 March. Proceedings (pp. 503–511). Springer. https://doi.org/10.1007/978-3-642-12116-6_43

Jiang, Z. & Tao, Y. (2017). Translation universals of discourse markers in Russian-to-Chinese academic texts: A corpus-based approach. Zeitschrift für Slawistik, 62(1), 583–605. https://doi.org/10.1515/slaw-2017-0037

Klein, G., Kim, Y., Deng, Y., Nguyen, V., Senel¬lart, J., & Rush, A. (2018). Opennmt: Neural Machine Translation Toolkit. Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (pp. 177–184, vol. 1, Research Papers). amta.

Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. Conference Proceedings: The Tenth Machine Translation Summit (pp. 79–86). Phuket, Thailand, aamt.

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007).

Moses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion (177–180). Prague, acl. https:// doi.org/10.3115/1557769.1557821

Koppel, M., & Ordan, N. (2011). Translationese and its dialects. Paper presented at the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland.

Kurokawa, D., Goutte, C., Isabelle, P. (2009). Automatic detection of translated text and its impact on machine translation. Proceedings of The Twelfth Machine Translation Summit International Association for Machine Translation (pp. 81–88). Ottawa: Association for Machine Translation in the Americas.

Laippala, V., Kanerva, J., Missil, A., Missilä, A., Pyysalo, S., Salakoski, T., & Ginter, F. (2015). Towards the classification of the Finnish Internet Parsebank: Detecting translations and informality. In Nodalida. Linköping University Electronic Press.

Laviosa, S. (1998). Core patterns of lexical use in a comparable corpus of English narrative prose. Meta, 43(4), 557–570. https://doi.org/10.7202/003425ar

Lin, C.-Y., Och, F. J. (2004). Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistic. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (pp. 605–612). Barcelona: acl. https://doi.org/10.3115/1218955.1219032

Mauranen, A. (2004). Corpora, universals and interference. In A. Mauranen, P.

Kujamäki. (Eds.), Benjamins translation library (pp. 65– 82). John Benjamins Publishing Company. https://doi.org/10.1075/btl.48.07mau

McEnergy, A. (2003). Corpus linguistics. In R. Mitkov (Ed.), Oxford handbook of computational lingustics. Oxford University Press.

McLaren, K. (2014). Bilinguisme législatif : regard sur l’interprétation et la rédaction des lois bilingues au Canada. Ottawa Law Review, 45(1), 21–37.

Neubig, G. (2017). Neural machine translation and sequence-to-sequence models: A tutorial. https://arx-iv.org/abs/1703.01619 (15th December, 2019).

Olohan, M. (2003). How frequent are the contractions?: A study of contracted forms in the Translational English Corpus. Target, 15, 59–89. https://doi.org/10.1075/target.15.1.04olo

Papineni, K., Roukos, S., Ward, T., Zhu, W. J. (2001). bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 311-318). Philadelphia, acl. https://doi.org/10.3115/1073083.1073135

Policy on legislative bijuralism. (1995). https://www.justice.gc.ca/eng/csj-sjc/harmonization/bijurilex/policy-politique.html (Accessed April 7th, 2021).

Popescu-Bels, A. (2019). Context in neural machine translation: A review of models and evaluations. https://arxiv.org/abs/1901.09115 (Access: 15th December, 2019).

Puurtinen, T. (2003). Genre-specific features of translationese? Linguistic differences between translated and non-translated Finnish children’s literature. Literary and Linguistic Computing, 18, 389-406. https://doi. org/10.1093/llc/18.4.389

Rabinovich, E., Ordan, N., & Wintner, S. (2017). Found in translation: Reconstructing phylogenetic language trees from translations (pp. 530–540). Association for Computational Linguistics. https://doi.org/10.18653/v1/P17-1049

Rapp, R., Sharoff, S., & Zweigenbaum, P. (2016). Recent advances in machine translation us¬ing comparable corpora. Natural Language Engineering, 22(4), 501–516. https://doi.org/10.1017/S1351324916000115

Regan, V., Lemée, I., & Conrick, M. (2011). Multiculturalism and integration: Canadian and Irish experiences. University of Ottawa Press.

Sánchez-Gijon, Pilar, Piqué, Ramon. (2020). nmt and the indivisibility of culture and language. ciuti Conference 2020. Artificial Intelligence & Intercultural Intelligence. Paris, 9–11 December. ciutiisit.

Schwenk, H., Wenzek, G., Edunov, S., Grave, E., Joulin, A. (2019). CCMatrix: Mining billions of high-quality parallel sentences on the web. https://arxiv.org/pdf/1911.04944.pdf (December 15th, 2019).

Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., & Varga, D. (2006). The jrc-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation. elra, 2142-2147.

Stahlberg, F. (2020). Neural machine translation: A review. Journal of Artificial Intelligence Research, 69, 343–418. https://doi.org/10.1613/ jair.1.12007

Sutskever, I., Vinyals, O., Le Quoc, V. (2014). Sequence to sequence learning with neural networks. nips’14: Proceedings of the 27th International Conference on Neural Information Processing Systems. (vol. 2, pp. 3104-3112).

Tirkkonen-Condit, S. (2002). Translationese —A myth or an empirical fact?: A study into the linguistic identifiability of translated language. Target, 14, 207-220. https://doi.org/10.1075/target.14.2.02tir

van Halteren, H. (2008). Source language markers in europarl translations. Proceedings of the 22nd International Conference on Computational Linguistics —coling ’08 (vol. 1, pp. 937–944). Stroudsburg, Association for Computational Linguistics. https://doi.org/10.3115/1599081.1599199

Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V., Nagy, V. (2005). Parallel corpora for medium density languages. Recent Advances in Natural Language Processing iv (pp. 247-258), Selected papers from ranlp. https://doi.org/10.1075/cilt.292.32var

Volansky, V., Ordan, N., & Wintner, S. (2015). On the features of translationese. Digital Scholarship in the Humanities, 30(1), 98-118. https://doi.org/10.1093/llc/fqt031

Way, A. (2018). Quality expectations of machine translation. In J. Moorkens, S. Castilho, F. Gaspari, & S. Doherty (Eds.), Translation quality assessment. Machine translation: Technologies and applications (vol. 1). Springer, Cham.

Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (2016). The United Nations parallel corpus, language resources and evaluation. Proceedings of the Tenth International Conference on Language Resources and Evaluation (pp. 3530-3534). PortoroŽ, Slovenia.

Téléchargements

Publié-e

2021-07-13

Comment citer

Sole-Mauri, F., Sánchez-Gijón, P., & Oliver, A. (2021). Cadlaws - Un corpus parallèle anglais-français de documents juridiquement équivalents. Mutatis Mutandis. Revista Latinoamericana De Traducción, 14(2), 494–508. https://doi.org/10.17533/udea.mut.v14n2a10