Neural Network Models of a Grammar Parser for the Kalmyk Language: Training Experience
https://doi.org/10.22162/2500-1523-2025-2-371-390
Abstract
Introduction. Kalmyk language presents unique challenges for NLP due to its agglutinative rich morphology and limited available resources. The objective is to consider various neural network models of grammar analysis for the Kalmyk language. Materials and Methods. Several neural network models were selected for training: Lemma Accuracy, Levenshtein Lemma Distance, Morph Accuracy, Morph F1. Neural network model training methods, analysis, and comparison methods were used. The training dataset used consisted of an organizational part in depth of 2 495 sentences (including 35 049 tokens), a validation part in depth of 311 sentences (including 3 991 tokens), and a test part in depth of 313 sentences (including 3 627 tokens). Results. This paper proposes a high-performing morphological analyzer for Kalmyk language using neural network techniques. The analyzer is able to jointly predict a lemmata and morphological tags for each word in a sentence. Due to the scarcity of the data, morphological analyzers for low-resource languages often utilizes rule-based and statistical approaches. However, there are few studies based on deep learning approaches. Firstly, our model inputs word embedding based on characters and contextual embeddings generated by the pretrained cross lingual model XLM-RoBERTa. Secondly, the proposed model is based on a sequential architecture which inputs surface words and predicts minimum edit actions between surface words and lemmas instead of predicting characters in lemmas. Thirdly, our system does not require pretrained embeddings for the Kalmyk language and additional morphological segmentation tools. We conducted several experiments to show that our model outperforms other models.
About the Authors
Abina D. KukanovaRussian Federation
Junior Research Associate
Viktoria V. Kukanova
Russian Federation
References
1. Abudouwaili G., Abiderexiti K., Yi N., Wumaier A. Joint Learning Model for Low-Resource Agglutinative Language Morphological Tagging Proceedings of the 20th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. Toronto: Association for Computational Linguistics, 2023. Pp. 27–37. DOI: 10.18653/v1/2023.sigmorphon-1.4 (In Eng.)
2. Akyürek E., Dayanık E., Yuret D. Morphological Analysis Using a Sequence Decoder Transactions of the Association for Computational Linguistics. 2019. No. 7. Pp. 567–579. DOI: 10.1162/tacl_a_00286 (In Eng.) (In Eng.)
3. Alajmi A. F., Saad E. M., Awadalla M. H. Hidden Markov Model Based Arabic Morphological Analyzer International Journal of Computer Engineering Research. 2011. No. 2(2). Pp. 28–33. DOI: 10.5897/ijcer.9000007 (In Eng.)
4. Baxi J., Bhatt B. Recent Advancements in Computational Morphology: a Comprehensive Survey. Available at: https://arxiv.org/pdf/2406.05424. 2024. Pp. 1–39. DOI: 10.48550/arxiv.2406.05424 (accessed: 05 May 2025) (In Eng.).
5. Baxi J., Patel P., Bhatt B. Morphological Analyzer for Gujarati Using Paradigm Based Approach with Knowledge Based and Statistical Methods. Proceedings of the 12th International Conference on Natural Language Processing. Trivandrum: NLP Association of India, 2015. Pp. 178–182. (In Eng.)
6. Bjerva J. One Model to Rule them all: Multitask and Multilingual Modelling for Lexical Available at: https://files.core.ac.uk/download/pdf/148336297.pdf. 2017. 266 р. DOI: 10.48550/arxiv.1711.01100 (accessed: 05 May 2025). (In Eng.)
7. Cing D. L., Soe K. M. Improving Accuracy of Part-of-Speech (POS) Tagging Using Hidden Markov Model and Morphological Analysis for Myanmar Language. International Journal of Electrical and Computer Engineering (IJECE). 2023. No. 10(2). Pp. 2023–2030. DOI: 10.11591/ijece.v10i2.pp2023-2030 (In Eng.)
8. Cotterell R., Heigold G. Cross-lingual Character-Level Neural Morphological Tagging Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Copenhagen: Association for Computational Linguistics, 2017. Pp. 748–759. DOI: 10.18653/v1/d17-1078 (In Eng.)
9. Creutz M., Lagus K. Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 ACM Transactions on Speech and Language Processing (TSLP). 2005. Vol. 4, Is. 1. Article No. 3. Pp. 1–34. DOI: 10.1145/1187415.1187418 (In Eng.)
10. Gebreselassie T. A., Washington J. N., Gasser M., Yimam B. A Finite-State Morphological Analyzer for Wolaytta Information and Communication Technology for Development for Africa. Vol. 244. Bahir Dar, 2018. Pp. 14–23. DOI: 10.1007/978-3-319-95153-9_2 (In Eng.)
11. Goldsmith J. Unsupervised Learning of the Morphology of a Natural Language Computational Linguistics. 2001. No. 27(2). Pp. 153–198. DOI: 10.1162/089120101750300490 (In Eng.)
12. Harris Z. S. From Phoneme to Morpheme Linguistic Society of America. 1955. Vol. 31. No. 2. Pp. 190–222. DOI: 10.1007/978-94-017-6059-1_2 (In Eng.)
13. Heigold G., Neumann G., Van Genabith J. An Extensive Empirical Evaluation of Character-Based Morphological Tagging for 14 Languages Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Vol. 1. Valencia: Association for Computational Linguistics, 2017. Pp. 505–513. DOI: 10.18653/v1/e17-1048 (In Eng.)
14. Jayaweera A., Dias N. Hidden Markov Model Based Part of Speech Tagger for Sinhala Language International Journal on Natural Language Computing. 2014. No. 3(3). Pp. 9–23. DOI: 10.5121/ijnlc.2014.3302 (In Eng.)
15. Kote N., Biba M., Kanerva J., Rönnqvist S., Ginter F. Morphological Tagging and Lemmatization of Albanian: A Manually Annotated Corpus and Neural Models Available at: https://arxiv.org/pdf/1912.00991. 2019. (accessed: 05 May 2025) (In Eng.)
16. Kondratyuk D., Gavenčiak T., Straka M., Hajič J. LemmaTag: Jointly Tagging and Lemmatizing for Morphologically Rich Languages with BRNNs Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Brussels: Association for Computational Linguistics, 2018. Pp. 4921–4928. DOI: 10.18653/v1/d18-1532 (In Eng.)
17. Kondratyuk D. Cross-Lingual Lemmatization and Morphology Tagging with Two-Stage Multilingual BERT Fine-Tuning. Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology. Florence: Association for Computational Linguistics, 2019. Pp. 12–18. DOI: 10.18653/v1/w19-4203 (In Eng.)
18. Koskenniemi K. Finite State Morphology and Information Retrieval Natural Language Engineering. 1996. No. 2(4). Pp. 331–336. DOI: 10.1017/s1351324997001587 (In Eng.)
19. Kumar M. A., Dhanalakshmi, Soman K. P., Rajendran S. A Sequence Labeling Approach to Morphological Analyzer for Tamil Language International Journal on Computer Science and Engineering, 2. 2010. No. 2(6). Pp. 1944–1951 (In Eng.)
20. Lindén K., Silfverberg M., Pirinen T. HFST Tools for Morphology — an Efficient Open-Source Package for Construction of Morphological Analyzers State of the Art in Computational Morphology. Vol. 41. Zurih, 2009. Pp. 28–47. DOI: 10.1007/978-3-642-04131-0_3 (In Eng.)
21. Liu L. Computational Morphology with Neural Network Approaches Available at: https://arxiv.org/pdf/2105.09404. 2021. DOI: 10.48550/arxiv.2105.09404 (accessed: 05 May 2025) (In Eng.)
22. McCarthy A. D., Vylomova E., Wu S., Malaviya C., Wolf-Sonkin L., Nicolai G., Kirov C., Silfverberg M., Mielke S. J., Heinz J., Cotterell R., Hulden M. The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection. Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology. Florence: Association for Computational Linguistics, 2019. Pp. 229–244. DOI: 10.18653/v1/w19-4226 (In Eng.)
23. Mueller T., Schmid H., Schütze H. Efficient Higher-Order CRFs for Morphological Tagging. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Seattle: Association for Computational Linguistics, 2013. Pp. 322–332. DOI: 10.18653/v1/d13-1032 (In Eng.)
24. Müller T., Cotterell R., Fraser A., Schütze H. Joint Lemmatization and Morphological Tagging with Lemming. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics, 2015. Pp. 2268–2274. DOI: 10.18653/v1/d15-1272 (In Eng.)
25. Narasimhan K., Kulkarni T., Barzilay R. Language Understanding for Text-based Games Using Deep Reinforcement Learning Available at: https://arxiv.org/pdf/1506.08941. 2015. (accessed: 05 May 2025) (In Eng.)
26. Special Interest Group on Computational Morphology and Phonology. Available at: https://sigmorphon.github.io/workshops/2019/ (accessed: 05 May 2025) (In Eng.)
27. Tamburini F. A BiLSTM-CRF PoS-tagger for Italian Tweets Using Morphological Information. Proceedings of the 5th International Workshop EVALITA 2016. Napoli, 2016. Pp. 2531–4548. DOI 10.4000/books.aaccademia.1899 (In Eng.)
28. Wróbel K., Nowak K. Transformer-based Part-of-Speech Tagging and Lemmatization for Latin. Proceedings of LT4HALA 2022-2st Workshop on Language Technologies for Historical and Ancient Languages. Marseille: European Language Resources Association, 2022. Pp. 193–197. (In Eng.)
29. Zueva A., Kuznetsova A., Tyers F. M. A Finite-State Morphological Analyser for Evenki Language Resources and Evaluation. Marseille: European Language Resources Association, 2020. Pp. 2581–2589. (In Eng.)
30. Yildiz E., Tantuğ A. C. Morpheus: A Neural Network for Jointly Learning Contextual Lemmatization and Morphological Tagging. Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology. Florence: Association for Computational Linguistics, 2019. Pp. 25–34.(In Eng.)
Review
For citations:
Kukanova A., Kukanova V. Neural Network Models of a Grammar Parser for the Kalmyk Language: Training Experience. Mongolian Studies. 2025;17(2):371-390. (In Russ.) https://doi.org/10.22162/2500-1523-2025-2-371-390



































