Creating a Parallel Corpora for Turkish-English Academic Translations

İlhami Sel; Hüseyin Üzen; Davut Hanbay

doi:10.53070/bbd.990959

Research Article

Creating a Parallel Corpora for Turkish-English Academic Translations

Year 2021, Volume: IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium Issue: Special, 335 - 340, 20.10.2021

İlhami Sel Hüseyin Üzen Davut Hanbay

https://doi.org/10.53070/bbd.990959

Cited By: 2

Abstract

Parallel corpora are data sets created by representing sentences with the same meaning in different languages. One of the most important elements that determine the quality in machine translation systems is the parallel corpora created in large quantities and with high quality. Such data for the Turkish – English language pair are generally insufficient. In this study, a large amount of parallel corpora has been created that can be used for academic translations between Turkish and English languages. While creating this data set, the abstracts of the postgraduate theses were used. The best matches were obtained using sentence alignment algorithms such as Vecalign and Hunalign. As a result of the studies, 1M parallel sentence pairs were obtained. In addition, an Bi-LSTM-based translation system was created to measure the quality of the obtained data. The created model obtained 15.8 Bleu points with zero-shot learning method on the TED (Tr-En) test set.

Keywords

Parallel Corpora, Neural Machine Translation, Sentence Alignment, Natural Language Processing.

References

Artetxe, Mikel, and Holger Schwenk. 2019. “Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond.” Transactions of the Association for Computational Linguistics 7: 597–610. https://doi.org/10.1162/tacl_a_00288.
Ataman, Duygu. 2018. “Bianet: A Parallel News Corpus in Turkish, Kurdish and English,” 1–4. http://arxiv.org/abs/1805.05095.
Barrault, Loïc, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, et al. 2019. “Findings of the 2019 Conference on Machine Translation (WMT19)” 2 (Day 1): 1–61. https://doi.org/10.18653/v1/w19-5301.
Bawden, Rachel, Giorgio Maria Di Nunzio, Cristian Grozea, Inigo Jauregi Unanue, Antonio Jimeno Yepes, Nancy Mah, David Martinez, et al. 2020. “Findings of the WMT 2020 Biomedical Translation Shared Task: Basque, Italian and Russian as New Additional Languages.” Proceedings of the Fifth Conference on Machine Translation, 660–87. https://www.aclweb.org/anthology/2020.wmt-1.76.
Britz, Denny, Anna Goldie, Minh Thang Luong, and Quoc V. Le. 2017. “Massive Exploration of Neural Machine Translation Architectures.” EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings, 1442–51. https://doi.org/10.18653/v1/d17-1151.
Chaudhary, Vishrav, Yuqing Tang, Francisco Guzmán, Holger Schwenk, and Philipp Koehn. 2019. “Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings” 3 (Day 2): 261–66. https://doi.org/10.18653/v1/w19-5435.
El-Kishky, Ahmed, Vishrav Chaudhary, Francisco Guzmán, and Philipp Koehn. 2020. “CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs,” 5960–69. https://doi.org/10.18653/v1/2020.emnlp-main.480.
Haddow, Barry, and Faheem Kirefu. 2020. “PMIndia -- A Collection of Parallel Corpora of Languages of India.” http://arxiv.org/abs/2001.09907.
Johnson, Melvin, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, et al. 2017. “Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation.” Transactions of the Association for Computational Linguistics 5: 339–51. https://doi.org/10.1162/tacl_a_00065.
Minaee, Shervin, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2020. “Deep Learning Based Text Classification: A Comprehensive Review.” ArXiv 54 (3).
Pavlick, Ellie, Matt Post, Ann Irvine, Dmitry Kachaev, and Chris Callison-Burch. 2014. “The Language Demographics of Amazon Mechanical Turk.” Transactions of the Association for Computational Linguistics 2: 79–92. https://doi.org/10.1162/tacl_a_00167.
Post, Matt. 2018. “A Call for Clarity in Reporting BLEU Scores.” Proceedings of the Third Conference on Machine Translation: Research Papers, April, 186–91. https://doi.org/10.18653/v1/W18-6319.
Qi, Ye, Devendra Singh Sachan, Matthieu Felix, Sarguna Janani Padmanabhan, and Graham Neubig. 2018. “When and Why Are Pre-Trainedword Embeddings Useful for Neural Machine Translation?” NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 2: 529–35.
Schwenk, Holger, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, and Francisco Guzmán. 2021. “WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia.” EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 1351–61.
Sel, İlhami, Ali Karci, and Davut Hanbay. 2019. “Karşılıklı Bilgi Kullanılarak Metin Sınıflandırma İçin Özellik Seçimi Feature Selection for Text Classification Using Mutual Information.” 2019 International Artificial Intelligence and Data Processing Symposium (IDAP), 18–21.
Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers 3: 1715–25.
Thompson, Brian, and Philipp Koehn. 2020. “Vecalign: Improved Sentence Alignment in Linear Time and Space.” EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 1342–48. https://doi.org/10.18653/v1/d19-1136.
Varga, Dániel, Péter Halácsy, András Kornai, Viktor Nagy, László Németh, and Viktor Trón. 2005. “Parallel Corpora for Medium Density Languages.” International Conference Recent Advances in Natural Language Processing, RANLP 2005-Janua (2003): 590–96. https://doi.org/10.1075/cilt.292.32var.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 2017-Decem (Nips): 5999–6009.
Yang, Shuoheng, Yuxin Wang, and Xiaowen Chu. 2020. “A Survey of Deep Learning Techniques for Neural Machine Translation.” http://arxiv.org/abs/2002.07526.

Türkçe-İngilizce Akademik Çeviriler için Paralel Corpora Oluşturulması

Year 2021, Volume: IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium Issue: Special, 335 - 340, 20.10.2021

İlhami Sel Hüseyin Üzen Davut Hanbay

https://doi.org/10.53070/bbd.990959

Cited By: 2

Abstract

Paralel corpora aynı anlama gelen cümlelerin farklı dillerde temsil edilmesiyle oluşturulan veri setleridir. Makine çeviri sistemlerinde kaliteyi belirleyen en önemli öğelerden birisi büyük miktarda ve yüksek kalitede oluşturulmuş paralel corporadır. Türkçe – İngilizce dil çifti için oluşturulan bu tür veriler genellikle yetersizdir. Bu çalışmada Türkçe – İngilizce dilleri arasında akademik çeviriler için kullanılabilecek büyük miktarda paralel corpora oluşturulmuştur. Bu veri seti oluşturulurken lisansüstü tezlerinin özet kısımları kullanılmıştır. Vecalign ve Hunalign gibi cümle hizalama algoritmaları kullanılarak en iyi eşleştirmeler elde edilmiştir. Yapılan çalışmalar sonucunda 1M paralel cümle çifti elde edilmiştir. Ayrıca elde edilen verinin kalitesini ölçebilmek için Bi-LSTM tabanlı çeviri sistemi oluşturulmuştur. Oluşturulan model TED(Tr-En) test seti üzerinde sıfır vuruş öğrenme (zero shot learning) yöntemiyle 15.8 Bleu puanı elde etmiştir.

Keywords

Paralel Corpora, Sinirsel Makine Çevirisi, Cümle Hizalama, Doğal Dil İşleme.

References

Artetxe, Mikel, and Holger Schwenk. 2019. “Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond.” Transactions of the Association for Computational Linguistics 7: 597–610. https://doi.org/10.1162/tacl_a_00288.
Ataman, Duygu. 2018. “Bianet: A Parallel News Corpus in Turkish, Kurdish and English,” 1–4. http://arxiv.org/abs/1805.05095.
Barrault, Loïc, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, et al. 2019. “Findings of the 2019 Conference on Machine Translation (WMT19)” 2 (Day 1): 1–61. https://doi.org/10.18653/v1/w19-5301.
Bawden, Rachel, Giorgio Maria Di Nunzio, Cristian Grozea, Inigo Jauregi Unanue, Antonio Jimeno Yepes, Nancy Mah, David Martinez, et al. 2020. “Findings of the WMT 2020 Biomedical Translation Shared Task: Basque, Italian and Russian as New Additional Languages.” Proceedings of the Fifth Conference on Machine Translation, 660–87. https://www.aclweb.org/anthology/2020.wmt-1.76.
Britz, Denny, Anna Goldie, Minh Thang Luong, and Quoc V. Le. 2017. “Massive Exploration of Neural Machine Translation Architectures.” EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings, 1442–51. https://doi.org/10.18653/v1/d17-1151.
Chaudhary, Vishrav, Yuqing Tang, Francisco Guzmán, Holger Schwenk, and Philipp Koehn. 2019. “Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings” 3 (Day 2): 261–66. https://doi.org/10.18653/v1/w19-5435.
El-Kishky, Ahmed, Vishrav Chaudhary, Francisco Guzmán, and Philipp Koehn. 2020. “CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs,” 5960–69. https://doi.org/10.18653/v1/2020.emnlp-main.480.
Haddow, Barry, and Faheem Kirefu. 2020. “PMIndia -- A Collection of Parallel Corpora of Languages of India.” http://arxiv.org/abs/2001.09907.
Johnson, Melvin, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, et al. 2017. “Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation.” Transactions of the Association for Computational Linguistics 5: 339–51. https://doi.org/10.1162/tacl_a_00065.
Minaee, Shervin, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2020. “Deep Learning Based Text Classification: A Comprehensive Review.” ArXiv 54 (3).
Pavlick, Ellie, Matt Post, Ann Irvine, Dmitry Kachaev, and Chris Callison-Burch. 2014. “The Language Demographics of Amazon Mechanical Turk.” Transactions of the Association for Computational Linguistics 2: 79–92. https://doi.org/10.1162/tacl_a_00167.
Post, Matt. 2018. “A Call for Clarity in Reporting BLEU Scores.” Proceedings of the Third Conference on Machine Translation: Research Papers, April, 186–91. https://doi.org/10.18653/v1/W18-6319.
Qi, Ye, Devendra Singh Sachan, Matthieu Felix, Sarguna Janani Padmanabhan, and Graham Neubig. 2018. “When and Why Are Pre-Trainedword Embeddings Useful for Neural Machine Translation?” NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference 2: 529–35.
Schwenk, Holger, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, and Francisco Guzmán. 2021. “WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia.” EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 1351–61.
Sel, İlhami, Ali Karci, and Davut Hanbay. 2019. “Karşılıklı Bilgi Kullanılarak Metin Sınıflandırma İçin Özellik Seçimi Feature Selection for Text Classification Using Mutual Information.” 2019 International Artificial Intelligence and Data Processing Symposium (IDAP), 18–21.
Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. “Neural Machine Translation of Rare Words with Subword Units.” 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers 3: 1715–25.
Thompson, Brian, and Philipp Koehn. 2020. “Vecalign: Improved Sentence Alignment in Linear Time and Space.” EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 1342–48. https://doi.org/10.18653/v1/d19-1136.
Varga, Dániel, Péter Halácsy, András Kornai, Viktor Nagy, László Németh, and Viktor Trón. 2005. “Parallel Corpora for Medium Density Languages.” International Conference Recent Advances in Natural Language Processing, RANLP 2005-Janua (2003): 590–96. https://doi.org/10.1075/cilt.292.32var.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 2017-Decem (Nips): 5999–6009.
Yang, Shuoheng, Yuxin Wang, and Xiaowen Chu. 2020. “A Survey of Deep Learning Techniques for Neural Machine Translation.” http://arxiv.org/abs/2002.07526.

There are 20 citations in total.

Details

Primary Language	English
Subjects	Artificial Intelligence
Journal Section	PAPERS
Authors	İlhami Sel 0000-0003-0222-7017 Hüseyin Üzen 0000-0002-0998-2130 Davut Hanbay 0000-0003-2271-7865
Publication Date	October 20, 2021
Submission Date	September 3, 2021
Acceptance Date	September 16, 2021
Published in Issue	Year 2021 Volume: IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium Issue: Special

Cite

APA	Sel, İ., Üzen, H., & Hanbay, D. (2021). Creating a Parallel Corpora for Turkish-English Academic Translations. Computer Science, IDAP-2021 : 5th International Artificial Intelligence and Data Processing symposium(Special), 335-340. https://doi.org/10.53070/bbd.990959

Cited By

Hybrid 3D/2D Complete Inception Module and Convolutional Neural Network for Hyperspectral Remote Sensing Image Classification

Neural Processing Letters

https://doi.org/10.1007/s11063-022-10929-z

Hybrid 3D Convolution and 2D Depthwise Separable Convolution Neural Network for Hyperspectral Image Classification

Balkan Journal of Electrical and Computer Engineering

https://doi.org/10.17694/bajece.1039029

Download Cover Image

Article Files

Full Text

The Creative Commons Attribution 4.0 International License is applied to all research papers published by JCS and

A Digital Object Identifier (DOI) is assigned for each published paper.