Cadlaws – An English–French Parallel Corpus of Legally Equivalent Documents

Authors

  • Francina Sole-Mauri Autonomous University of Barcelona
  • Pilar Sánchez-Gijón Autonomous University of Barcelona
  • Antoni Oliver Open University of Catalonia

DOI:

https://doi.org/10.17533/udea.mut.v14n2a10

Keywords:

corpus construction, parallel corpus, Neural Machine Translation (NMT), English–French translation, Cadlaws

Abstract

This article presents Cadlaws, a new English–French corpus built from Canadian legal documents, and describes the corpus construction process and preliminary statistics obtained from it. The corpus contains over 16 million words in each language and includes unique features since it is composed of documents that are legally equivalent in both languages but not the result of a translation. The corpus is built upon enactments co-drafted by two jurists to ensure legal equality of each version and to re­flect the concepts, terms and institutions of two legal traditions. In this article the corpus definition as a parallel corpus instead of a comparable one is also discussed. Cadlaws has been pre-processed for machine translation and baseline Bilingual Evaluation Understudy (bleu), a score for comparing a candidate translation of text to a gold-standard translation of a neural machine translation system. To the best of our knowledge, this is the largest parallel corpus of texts which convey the same meaning in this language pair and is freely available for non-commercial use.

|Abstract
= 573 veces | PDF
= 517 veces|

Downloads

Download data is not yet available.

Author Biographies

Francina Sole-Mauri, Autonomous University of Barcelona

Doctoral student of the doctoral program in translation and intercultural studies at the Autonomous University of Barcelona (UAB). Her main research areas are machine translation and computational linguistics. She is a member of the DESPITE-MT project: Describing PostEditese in Machine Translation (Ministry of Science and Innovation).

Pilar Sánchez-Gijón, Autonomous University of Barcelona

Degree in Modern Applied Languages ​​from "Babes-Bolyai" University, Cluj Napoca, Romania, Doctorate in Specialized Translation Studies from Pompeu Fabra University, Barcelona. It is currently a permanent profession of the Department of Translation and Interpretation of the National School of Language, Language and Translation of UNAM, or the teaching of private lessons in translation, theories of translation, documentation and terminology. . In addition, she is the coordinator of the Diploma in English-Spanish Legal Translation at a distance from UNAM. Sessions focus on legal translation studies, computer science and documentaries for the translator and interpreter, lexicography and terminology applied to translation, forensic interpretation.

Antoni Oliver, Open University of Catalonia

Associate Professor of Arts and Humanities Studies at the Open University of Catalonia (UOC) and director of the Master's degree in Translation and Technologies at this university. His main research areas are machine translation and the automatic generation of lexical and terminological resources.

References

Allard, F. (2001). The Supreme Court of Canada and its impact on the expression of bijuralism. The Harmonization of federal legislation with the civil law of the province of Quebec and Canadian bijuralism (Second Publication), Booklet 3, Ot-tawa, Department of Justice Canada.

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. Conference paper presented at the 2015 International Conference on Learn¬ing Representations —icrl—. https://arxiv. org/abs/1409.0473

Baker, M. (1993). Corpus linguistics and transla¬tion studies: Implications and applications. In M. Baker, G. Francis, & E. Tognini-Bonelli (Eds.), Text and technology: In honour of John Sinclair (pp. 233-252). John Benjamins. https://doi.org/10.1075/z.64.15bak

Baroni, M. & Bernardini, S. (2005). A new approach to the study of translationese: Ma¬chine-learning the difference between original and translated text. Literary and Lin¬guistic Computing, 21(3), 259–274. https://doi. org/10.1093/llc/fqi039

Canada Government, Department of Justice. (s. f.). Justice Laws [Website]. https://laws-lois.justice.gc.ca/eng/ (Accessed April 7th, 2021).

Carter, D. & Inkpen, D. (2012). Searching for poor quality machine translated text: Learning the difference between human writing and machine translations. In L. Kosseim & D. Inkpen (Eds.), Advances in artificial intelligence (pp. 49– 60). Springer. https://doi.org/10.1007/978-3- 642-30353-1_5

Comparin, L.. (2017). Quality in machine transla¬tion and human post-editing: Error annotation and specifications. [M. A. thesis], Universidade de Lisboa, Lisbon. https://repositorio.ul.pt/handle/10451/27969?mode=full

Cromieres, F., Toshiaki, N., & Raj, D. (2017). Neural machine translation: Basics, practical aspects and recent trends. Proceedings of the ijcnlp 2017, Tutorial Abstracts. Asian Federation of Natural Language Processing, Taipei, Taiwan, 11-13.

Daems, J., De Clercq, O., & Macken, L. (2017). Translationese and post-editese: How comparable is comparable quality? Linguistica Antverpiensia, New Series: Themes in Translation Stud¬ies, 16, 89–103.

Daems, J., Vandepitte, S., Hartsuiker, R. J., & Macken, L. (2017). Identifying the machine translation error types with the greatest impact on post-editing effort.

Frontiers in Psychology, 8, 1282. https://doi.org/10.3389/fpsyg.2017.01282

Farrús, M., Costa-Jussà, M. R., Mariño, J. B., Poch, M., Hernández, A., Henríquez, C., & Fonollosa, J. A. (2011). Overcoming statistical machine translation limitations: error analysis and proposed solutions for the Catalan–Spanish language pair. Language Resources and Evalu¬ation, 45, 181-208. https://doi.org/10.1007/ s10579-011-9137-0

Federal Law-Civil Law Act, N.°1. Royal Assent. Government of Canada https://www.parl. ca/DocumentViewer/en/37-1/bill/S-4/royal-assent (Accessed April 7th, 2021).

Freitag, M., Caswell, I., & Roy, S. (2019). ape at scale and its implications on mt evaluation bi¬ases. Fourth Conference on Machine Translation (wmt) (vol. 2, pp. 34–44). Florence. https:// doi.org/10.18653/v1/W19-5204

Germann, U. (2001). Aligned Hansard of the 36th Parliament of Canada. Natural Language Group of the usc Information Sciences Institute. https://www.isi.edu/natural-language/ download/Hansard/ (15th December, 2019).

Gervais, M.-F. & Séguin, M.-C. (2001). Some thoughts on bijuralism in Canada and the world. In Canada, Department of Justice, The harmonization of federal legislation with the civil law of the province of Quebec and Canadian bijuralism. Ottawa, Department of Justice. https://www.justice.gc.ca/eng/rp-pr/csj-sjc/ harmonization/hfl-hlf/b2-f2/bf2.pdf (Accessed April 7th, 2021).

Hansen-Schirra, S. (2011). Between normalization and shining-through: Specific properties of English-German translations and their influence on the target language. In S. Kranich, V. Becher, S. Höder, & J. House (Eds.), Hamburg Studies on Multilingualism (pp. 133–162). John Benjamins Publishing Company. https://doi.org/10.1075/hsm.12.07han

Hewavitharana, S. & Vogel, S. (2016). Extracting parallel phrases from comparable data for machine translation. Natural Language Engineering, 22(4), 549–573. https://doi.org/10.1017/ S1351324916000139

Ilisei, I., Inkpen, D., Corpas Pastor, G., & Mitkov, R. (2010). Identification of translationese: A machine learning approach. In A. Gelbukh (Ed.), Computational linguistics and intelligent text processing. 11th International Conference, CICLing 2010, Iaşi, Romania, 21–27 March. Proceedings (pp. 503–511). Springer. https://doi.org/10.1007/978-3-642-12116-6_43

Jiang, Z. & Tao, Y. (2017). Translation universals of discourse markers in Russian-to-Chinese academic texts: A corpus-based approach. Zeitschrift für Slawistik, 62(1), 583–605. https://doi.org/10.1515/slaw-2017-0037

Klein, G., Kim, Y., Deng, Y., Nguyen, V., Senel¬lart, J., & Rush, A. (2018). Opennmt: Neural Machine Translation Toolkit. Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (pp. 177–184, vol. 1, Research Papers). amta.

Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. Conference Proceedings: The Tenth Machine Translation Summit (pp. 79–86). Phuket, Thailand, aamt.

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007).

Moses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion (177–180). Prague, acl. https:// doi.org/10.3115/1557769.1557821

Koppel, M., & Ordan, N. (2011). Translationese and its dialects. Paper presented at the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland.

Kurokawa, D., Goutte, C., Isabelle, P. (2009). Automatic detection of translated text and its impact on machine translation. Proceedings of The Twelfth Machine Translation Summit International Association for Machine Translation (pp. 81–88). Ottawa: Association for Machine Translation in the Americas.

Laippala, V., Kanerva, J., Missil, A., Missilä, A., Pyysalo, S., Salakoski, T., & Ginter, F. (2015). Towards the classification of the Finnish Internet Parsebank: Detecting translations and informality. In Nodalida. Linköping University Electronic Press.

Laviosa, S. (1998). Core patterns of lexical use in a comparable corpus of English narrative prose. Meta, 43(4), 557–570. https://doi.org/10.7202/003425ar

Lin, C.-Y., Och, F. J. (2004). Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistic. Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (pp. 605–612). Barcelona: acl. https://doi.org/10.3115/1218955.1219032

Mauranen, A. (2004). Corpora, universals and interference. In A. Mauranen, P.

Kujamäki. (Eds.), Benjamins translation library (pp. 65– 82). John Benjamins Publishing Company. https://doi.org/10.1075/btl.48.07mau

McEnergy, A. (2003). Corpus linguistics. In R. Mitkov (Ed.), Oxford handbook of computational lingustics. Oxford University Press.

McLaren, K. (2014). Bilinguisme législatif : regard sur l’interprétation et la rédaction des lois bilingues au Canada. Ottawa Law Review, 45(1), 21–37.

Neubig, G. (2017). Neural machine translation and sequence-to-sequence models: A tutorial. https://arx-iv.org/abs/1703.01619 (15th December, 2019).

Olohan, M. (2003). How frequent are the contractions?: A study of contracted forms in the Translational English Corpus. Target, 15, 59–89. https://doi.org/10.1075/target.15.1.04olo

Papineni, K., Roukos, S., Ward, T., Zhu, W. J. (2001). bleu: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (pp. 311-318). Philadelphia, acl. https://doi.org/10.3115/1073083.1073135

Policy on legislative bijuralism. (1995). https://www.justice.gc.ca/eng/csj-sjc/harmonization/bijurilex/policy-politique.html (Accessed April 7th, 2021).

Popescu-Bels, A. (2019). Context in neural machine translation: A review of models and evaluations. https://arxiv.org/abs/1901.09115 (Access: 15th December, 2019).

Puurtinen, T. (2003). Genre-specific features of translationese? Linguistic differences between translated and non-translated Finnish children’s literature. Literary and Linguistic Computing, 18, 389-406. https://doi. org/10.1093/llc/18.4.389

Rabinovich, E., Ordan, N., & Wintner, S. (2017). Found in translation: Reconstructing phylogenetic language trees from translations (pp. 530–540). Association for Computational Linguistics. https://doi.org/10.18653/v1/P17-1049

Rapp, R., Sharoff, S., & Zweigenbaum, P. (2016). Recent advances in machine translation us¬ing comparable corpora. Natural Language Engineering, 22(4), 501–516. https://doi.org/10.1017/S1351324916000115

Regan, V., Lemée, I., & Conrick, M. (2011). Multiculturalism and integration: Canadian and Irish experiences. University of Ottawa Press.

Sánchez-Gijon, Pilar, Piqué, Ramon. (2020). nmt and the indivisibility of culture and language. ciuti Conference 2020. Artificial Intelligence & Intercultural Intelligence. Paris, 9–11 December. ciutiisit.

Schwenk, H., Wenzek, G., Edunov, S., Grave, E., Joulin, A. (2019). CCMatrix: Mining billions of high-quality parallel sentences on the web. https://arxiv.org/pdf/1911.04944.pdf (December 15th, 2019).

Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., & Varga, D. (2006). The jrc-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation. elra, 2142-2147.

Stahlberg, F. (2020). Neural machine translation: A review. Journal of Artificial Intelligence Research, 69, 343–418. https://doi.org/10.1613/ jair.1.12007

Sutskever, I., Vinyals, O., Le Quoc, V. (2014). Sequence to sequence learning with neural networks. nips’14: Proceedings of the 27th International Conference on Neural Information Processing Systems. (vol. 2, pp. 3104-3112).

Tirkkonen-Condit, S. (2002). Translationese —A myth or an empirical fact?: A study into the linguistic identifiability of translated language. Target, 14, 207-220. https://doi.org/10.1075/target.14.2.02tir

van Halteren, H. (2008). Source language markers in europarl translations. Proceedings of the 22nd International Conference on Computational Linguistics —coling ’08 (vol. 1, pp. 937–944). Stroudsburg, Association for Computational Linguistics. https://doi.org/10.3115/1599081.1599199

Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V., Nagy, V. (2005). Parallel corpora for medium density languages. Recent Advances in Natural Language Processing iv (pp. 247-258), Selected papers from ranlp. https://doi.org/10.1075/cilt.292.32var

Volansky, V., Ordan, N., & Wintner, S. (2015). On the features of translationese. Digital Scholarship in the Humanities, 30(1), 98-118. https://doi.org/10.1093/llc/fqt031

Way, A. (2018). Quality expectations of machine translation. In J. Moorkens, S. Castilho, F. Gaspari, & S. Doherty (Eds.), Translation quality assessment. Machine translation: Technologies and applications (vol. 1). Springer, Cham.

Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (2016). The United Nations parallel corpus, language resources and evaluation. Proceedings of the Tenth International Conference on Language Resources and Evaluation (pp. 3530-3534). PortoroŽ, Slovenia.

Downloads

Published

2021-07-13

How to Cite

Sole-Mauri, F., Sánchez-Gijón, P., & Oliver, A. (2021). Cadlaws – An English–French Parallel Corpus of Legally Equivalent Documents. Mutatis Mutandis. Revista Latinoamericana De Traducción, 14(2), 494–508. https://doi.org/10.17533/udea.mut.v14n2a10