Matrix Similarity Analysis of Texts Written in Belarusian and Ukrainian

Artur Niewiarowski; Anna Plichta

doi:10.24423/cames.2025.1657

Authors

Artur Niewiarowski Department of Computer Science, Faculty of Computer Science and Telecommunications, Cracow University of Technology, Kraków, Poland
Anna Plichta Department of Computer Science, Faculty of Computer Science and Telecommunications, Cracow University of Technology, Kraków, Poland

Abstract

This publication presents the results of a study on text similarity between Belarusian and Ukrainian, utilizing a matrix-based analysis method grounded in edit distance. A distinctive feature of this approach is the absence of language-specific vocabulary rules, highlighting the algorithm’s linguistic universality in similarity analysis. The analyzed texts were sourced from excerpts of online encyclopedias, translated using AI-powered online translation services provided by well-known companies. The primary objective of this study is to determine whether it is possible to compare texts written in these languages without prior translation into a common language. Additionally, it aims to assess whether a method that does not belong to the large language model (LLM) family or the broader category of AI-based approaches can effectively compare languages within the same linguistic group. Furthermore, the study provides insights into the degree of similarity between Belarusian and Ukrainian, investigating the extent to which speakers of one language might partially understand the other.

Keywords:

text-mining, anti-plagiarism, text similarity analysis, Levenshtein edit distance, matrix-based text analysis, Belarusian language, Ukrainian language, East Slavic language group, Old Russian language, Indo-European language, Bielaruskaja mova, Ukrainska mova

References

A. Niewiarowski, Zastosowanie algorytmu odległości edycyjnej do ilościowej analizy danych tekstowych [in Polish], PhD dissertation, IPPT PAN, Warsaw, 2024.

A. Niewiarowski, Similarity detection based on document matrix model and edit distance algorithm, Computer Assisted Methods in Engineering and Science, 26(3–4): 163–175, 2019, https://doi.org/10.24423/cames.277

A. Niewiarowski, Short text similarity algorithm based on the edit distance and thesaurus, Technical Transactions, 113(1-NP): 159–173, 2016, https://doi.org/10.4467/2353737XCT.16.149.5760

A. Niewiarowski, M. Stanuszek, Parallelization of the Levenshtein distance algorithm, Technical Transactions, 111(3-NP): 109–122, 2014, https://doi.org/10.4467/2353737XCT.14.319.3407

K. Katzner, The Languages of the World, Taylor & Francis, London, 2002.

R. Posner, The Romance Languages, Cambridge Language Surveys, Cambridge University Press, Cambridge, 1996.

R. Penny, A History of the Spanish Language, Cambridge University Press, Cambridge, 2002.

V.I. Levenshtein, Binary codes for correcting dropouts, inserts, and symbol substitutions [in Russian], Reports of the Academy of Sciences of the USSR, 163(4): 845–848, 1965.

P.R. Petrucci, Slavic Features in the History of Rumanian, LINCOM Europa, München, 1999.

A. Dziob, M. Piasecki, Implementation of the verb model in plWordNet 4.0, [in:] Proceedings of the 9th Global Wordnet Conference, Singapore, January 8–12, pp. 113–122, Nanyang Technological University, 2018.

W. B.A. Karaa, A new stemmer to improve information retrieval, International Journal of Network Security & Its Applications, 5(4): 143–154, 2013, https://doi.org/10.5121/ijnsa.2013.5411

D. Khyani et al., An interpretation of lemmatization and stemming in natural language processing, Journal of University of Shanghai for Science and Technology, 22(10): 350–357, 2021.

M.M. Maulana, R. Arifudin, A. Alamsyah, Autocomplete and spell checking Levenshtein distance algorithm for text suggestion error data searching in library, Scientific Journal of Informatics, 5(1): 75, 2018, https://doi.org/10.15294/sji.v5i1.14148

R. Gabrys, E. Yaakobi, O. Milenkovic, Codes in the Damerau distance for DNA storage, [in:] 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, pp. 2644–2648, 2016, https://doi.org/10.1109/ISIT.2016.7541778

R. Smith, An overview of the Tesseract OCR engine, [in:] Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Brazil, 2: 629–633, 2007, https://doi.org/10.1109/ICDAR.2007.4376991

B.D. Lund, T. Wang, Chatting about ChatGPT: How may AI and GPT impact academia and libraries?, Library Hi Tech News, 40(3): 26–29, 2023, https://doi.org/10.1108/LHTN-01-2023-0009

A.J. Adetayo, Artificial intelligence chatbots in academic libraries: The rise of ChatGPT, Library Hi Tech News, 40(3): 18–21, 2023, https://doi.org/10.1108/LHTN-01-2023-0007

O. Bakhteev et al., Cross-language plagiarism detection: A case study of European languages academic works, [in:] S. Bjelobaba, T. Foltýnek, I. Glendinning, V. Krásničan, D.H. Dlabolová [Eds.], Academic Integrity: Broadening Practices, Technologies, and the Role of Students, Ethics and Integrity in Educational Contexts, Vol. 4, Springer, Cham, pp. 143–161, 2022, https://doi.org/10.1007/978-3-031-16976-2_9

B. Agarwal, Cross-lingual plagiarism detection techniques for English-Hindi language pairs, Journal of Discrete Mathematical Sciences and Cryptography, 22(4): 679–686, 2019, https://doi.org/10.1080/09720529.2019.1642626

A. Niewiarowski, A. Plichta, Matrix similarity analysis of texts written in Romanian and Spanish, [in:] ECMS 2023: Proceedings of the 37th ECMS International Conference on Modelling and Simulation, Florence, Italy, June 20–23, 37(1): 507–512, 2023.

V. Komorovskaya, The future of the Belarusian language: Is it doomed to extinction? Controversies and challenges in language maintenance and revitalization, Acta Philologica, 48: 15–28, 2016.

M.S. Flier, A. Graziosi, The battle for Ukrainian: An introduction, Harvard Ukrainian Studies: The Journal of the Ukrainian Research Institute at Harvard University, 35(1–4): 11–30, 2017–2018.

E. Agirre, Cross-Lingual Word Embeddings, Computational Linguistics, 46(1): 245–248, 2020, https://doi.org/10.1162/coli_r_00372

N.R. Schneider, A. Das, K. O'Sullivan, H. Samet, Cross-lingual clustering using large language models, [in:] Proceedings of the 7th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery (GeoAI '24), Association for Computing Machinery, New York, USA, pp. 1–10, 2024, https://doi.org/10.1145/3687123.3698280

S. Dutta, “Alignment is all you need”: Analyzing cross-lingual text similarity for domain-specific applications, [in:] Proceedings of the International Workshop on Cross-lingual Event-centric Open Analytics, CEUR Workshop Proceedings, Vol. 2829, pp. 13–22, 2021.

C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, Cambridge, 2008.

Website: https://antyplagius.n-dms.com https://antyplagius.n-dms.com New Data Mining Systems sp. z o.o. YouTube channel of the project: https://youtube.com/@n-dms

Slavic languages, Britannica, https://www.britannica.com/topic/Slavic-languages

C.D. Manning, P. Raghavan, H. Schütze, Stemming and lemmatization, in:. Introduction to Information Retrieval, Cambridge University Press, 2008, https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

Google Translate, https://translate.google.com

Microsoft Translator, https://www.bing.com/translator.

Additional Online Resources

Online first
Accepted manuscripts
2026, Vol 33
	No 2	No 1
2025, Vol 32
	No 1	No 2	No 3	No 4
2024, Vol 31
	No 1	No 2	No 3	No 4
2023, Vol 30
	No 1	No 2	No 3	No 4
2022, Vol 29
	No 1-2		No 3	No 4
2021, Vol 28
	No 1	No 2	No 3	No 4
2020, Vol 27
	No 1	No 2-3		No 4
2019, Vol 26
	No 1	No 2	No 3-4
2018, Vol 25
	No 1	No 2-3		No 4
2017, Vol 24
	No 1	No 2	No 3	No 4
2016, Vol 23
	No 1	No 2-3		No 4
2015, Vol 22
	No 1	No 2	No 3	No 4
2014, Vol 21
	No 1	No 2	No 3-4
2013, Vol 20
	No 1	No 2	No 3	No 4
2012, Vol 19
	No 1	No 2	No 3	No 4
2011, Vol 18
	No 1-2		No 3	No 4
2010, Vol 17
	No 1	No 2/3/4
2009, Vol 16
	No 1	No 2	No 3-4
2008, Vol 15
	No 1	No 2	No 3-4
2007, Vol 14
	No 1	No 2	No 3	No 4
2006, Vol 13
	No 1	No 2	No 3	No 4
2005, Vol 12
	No 1	No 2-3		No 4
2004, Vol 11
	No 1	No 2-3		No 4
2003, Vol 10
	No 1	No 2	No 3	No 4
2002, Vol 9
	No 1	No 2	No 3	No 4
2001, Vol 8
	No 1	No 2-3		No 4
2000, Vol 7
	No 1	No 2	No 3	No 4
1999, Vol 6
	No 1	No 2	No 3-4
1998, Vol 5
	No 1	No 2	No 3	No 4
1997, Vol 4
	No 1	No 2	No 3-4
1996, Vol 3
	No 1	No 2	No 3	No 4
1995, Vol 2
	No 1	No 2	No 3	No 4
1994, Vol 1
	No 1-2		No 3-4

Matrix Similarity Analysis of Texts Written in Belarusian and Ukrainian

Downloads

Authors

Abstract

Keywords:

References

Other articles by the same author(s)

cover

ippt-pan

Issue

Pages

Section

DOI

Received

Accepted

Published

License

How to Cite

Principal Contact

Address

Support Contact