Abstract
This publication presents the results of a study on text similarity between Belarusian and Ukrainian, utilizing a matrix-based analysis method grounded in edit distance. A distinctive feature of this approach is the absence of language-specific vocabulary rules, highlighting the algorithm’s linguistic universality in similarity analysis. The analyzed texts were sourced from excerpts of online encyclopedias, translated using AI-powered online translation services provided by well-known companies. The primary objective of this study is to determine whether it is possible to compare texts written in these languages without prior translation into a common language. Additionally, it aims to assess whether a method that does not belong to the large language model (LLM) family or the broader category of AI-based approaches can effectively compare languages within the same linguistic group. Furthermore, the study provides insights into the degree of similarity between Belarusian and Ukrainian, investigating the extent to which speakers of one language might partially understand the other.
Keywords:
text-mining, anti-plagiarism, text similarity analysis, Levenshtein edit distance, matrix-based text analysis, Belarusian language, Ukrainian language, East Slavic language group, Old Russian language, Indo-European language, Bielaruskaja mova, Ukrainska movaReferences
- A. Niewiarowski, Zastosowanie algorytmu odległości edycyjnej do ilościowej analizy danych tekstowych [in Polish], PhD dissertation, IPPT PAN, Warsaw, 2024.
- A. Niewiarowski, Similarity detection based on document matrix model and edit distance algorithm, Computer Assisted Methods in Engineering and Science, 26(3–4): 163–175, 2019, https://doi.org/10.24423/cames.277.
- A. Niewiarowski, Short text similarity algorithm based on the edit distance and thesaurus, Technical Transactions, 113(1-NP): 159–173, 2016, https://doi.org/10.4467/2353737XCT.16.149.5760.
- A. Niewiarowski, M. Stanuszek, Parallelization of the Levenshtein distance algorithm, Technical Transactions, 111(3-NP): 109–122, 2014, https://doi.org/10.4467/2353737XCT.14.319.3407.
- K. Katzner, The Languages of the World, Taylor & Francis, London, 2002.
- R. Posner, The Romance Languages, Cambridge Language Surveys, Cambridge University Press, Cambridge, 1996.
- R. Penny, A History of the Spanish Language, Cambridge University Press, Cambridge, 2002.
- V.I. Levenshtein, Binary codes for correcting dropouts, inserts, and symbol substitutions [in Russian], Reports of the Academy of Sciences of the USSR, 163(4): 845–848, 1965.
- P.R. Petrucci, Slavic Features in the History of Rumanian, LINCOM Europa, München, 1999.
- A. Dziob, M. Piasecki, Implementation of the verb model in plWordNet 4.0, [in:] Proceedings of the 9th Global Wordnet Conference, Singapore, January 8–12, pp. 113–122, Nanyang Technological University, 2018.
- W. B.A. Karaa, A new stemmer to improve information retrieval, International Journal of Network Security & Its Applications, 5(4): 143–154, 2013, https://doi.org/10.5121/ijnsa.2013.5411.
- D. Khyani et al., An interpretation of lemmatization and stemming in natural language processing, Journal of University of Shanghai for Science and Technology, 22(10): 350–357, 2021.
- M.M. Maulana, R. Arifudin, A. Alamsyah, Autocomplete and spell checking Levenshtein distance algorithm for text suggestion error data searching in library, Scientific Journal of Informatics, 5(1): 75, 2018, https://doi.org/10.15294/sji.v5i1.14148.
- R. Gabrys, E. Yaakobi, O. Milenkovic, Codes in the Damerau distance for DNA storage, [in:] 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, pp. 2644–2648, 2016, https://doi.org/10.1109/ISIT.2016.7541778.
- R. Smith, An overview of the Tesseract OCR engine, [in:] Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Brazil, 2: 629–633, 2007, https://doi.org/10.1109/ICDAR.2007.4376991.
- B.D. Lund, T. Wang, Chatting about ChatGPT: How may AI and GPT impact academia and libraries?, Library Hi Tech News, 40(3): 26–29, 2023, https://doi.org/10.1108/LHTN-01-2023-0009.
- A.J. Adetayo, Artificial intelligence chatbots in academic libraries: The rise of ChatGPT, Library Hi Tech News, 40(3): 18–21, 2023, https://doi.org/10.1108/LHTN-01-2023-0007.
- O. Bakhteev et al., Cross-language plagiarism detection: A case study of European languages academic works, [in:] S. Bjelobaba, T. Foltýnek, I. Glendinning, V. Krásničan, D.H. Dlabolová [Eds.], Academic Integrity: Broadening Practices, Technologies, and the Role of Students, Ethics and Integrity in Educational Contexts, Vol. 4, Springer, Cham, pp. 143–161, 2022, https://doi.org/10.1007/978-3-031-16976-2_9.
- B. Agarwal, Cross-lingual plagiarism detection techniques for English-Hindi language pairs, Journal of Discrete Mathematical Sciences and Cryptography, 22(4): 679–686, 2019, https://doi.org/10.1080/09720529.2019.1642626.
- A. Niewiarowski, A. Plichta, Matrix similarity analysis of texts written in Romanian and Spanish, [in:] ECMS 2023: Proceedings of the 37th ECMS International Conference on Modelling and Simulation, Florence, Italy, June 20–23, 37(1): 507–512, 2023.
- V. Komorovskaya, The future of the Belarusian language: Is it doomed to extinction? Controversies and challenges in language maintenance and revitalization, Acta Philologica, 48: 15–28, 2016.
- M.S. Flier, A. Graziosi, The battle for Ukrainian: An introduction, Harvard Ukrainian Studies: The Journal of the Ukrainian Research Institute at Harvard University, 35(1–4): 11–30, 2017–2018.
- E. Agirre, Cross-Lingual Word Embeddings, Computational Linguistics, 46(1): 245–248, 2020, https://doi.org/10.1162/coli_r_00372.
- N.R. Schneider, A. Das, K. O'Sullivan, H. Samet, Cross-lingual clustering using large language models, [in:] Proceedings of the 7th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery (GeoAI '24), Association for Computing Machinery, New York, USA, pp. 1–10, 2024, https://doi.org/10.1145/3687123.3698280.
- S. Dutta, “Alignment is all you need”: Analyzing cross-lingual text similarity for domain-specific applications, [in:] Proceedings of the International Workshop on Cross-lingual Event-centric Open Analytics, CEUR Workshop Proceedings, Vol. 2829, pp. 13–22, 2021.
- C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, Cambridge, 2008.
- Website: https://antyplagius.n-dms.com. https://antyplagius.n-dms.com., New Data Mining Systems sp. z o.o. YouTube channel of the project: https://youtube.com/@n-dms.
- Slavic languages, Britannica, https://www.britannica.com/topic/Slavic-languages.
- C.D. Manning, P. Raghavan, H. Schütze, Stemming and lemmatization, in:. Introduction to Information Retrieval, Cambridge University Press, 2008, https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html.
- Google Translate, https://translate.google.com.
- Microsoft Translator, https://www.bing.com/translator.
Additional Online Resources
A1. A full excerpt from the text is available at https://antyplagius.n-dms.com/tests/Spanish-Romanian/Espania-Spanish-wikipedia-google-translate.txt.
A2. A full excerpt from the text is available at https://antyplagius.n-dms.com/tests/Spanish-Romanian/Espania-Romanian-wikipedia-google-translate.txt.
A3. N-DMS, Belarusian and Ukrainian – an analysis of similarities. Antyplagiat N-DMS Antyplagius [in Polish], YouTube, 06.01.2022, https://youtu.be/d6o3QAQDWPk.
A4. N-DMS ANTYPLAGIUS, Belarus – Wikipedia, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/Belarus-Wikipedia.pdf.
A5. Wikipedia, Belarus, https://be.wikipedia.org/wiki/%D0%91%D0%B5%D0%BB%D0%B0%D1%80%D1%83%D1%81%D1%8C.
A6. N-DMS ANTYPLAGIUS, Belarusian-Belarus, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/Belarusian-Belarus.txt.
A7. N-DMS ANTYPLAGIUS, Ukrainian-Belarus, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/Ukrainian-Belarus.txt.
A8. N-DMS ANTYPLAGIUS, EN Fragment – Ukraine – Britannica Online Encyclopedia, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/EN%20FRAGMENT%20-%20Ukraine%20.–%20Britannica%20Online%20Encyclopedia.txt.
A9. N-DMS ANTYPLAGIUS, Ukraine – Britannica Online Encyclopedia, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/Ukraine%20.–%20Britannica%20Online%20Encyclopedia.pdf.
A10. Ukraine, Britannica, https://www.britannica.com/place/Ukraine.
A11. N-DMS ANTYPLAGIUS, Chapter “Plant and animal life” in English from encyclopaedia: https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/ENG%20-%20Brit%20-%20Animals%20-%20google%20translate.txt.
A12. N-DMS ANTYPLAGIUS, Belarusian, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/BY%20-%20Brit%20-%20Animals%20-%20google%20translate.txt.
A13. N-DMS ANTYPLAGIUS, Ukrainian, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/UA%20-%20Brit%20-%20Animals%20-%20google%20translate.txt.
A14. N-DMS ANTYPLAGIUS, Bulgarian, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/BUL%20-%20Brit%20-%20Animals%20-%20google%20translate.txt.
A15. N-DMS ANTYPLAGIUS, Serbian, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/SR%20-%20Brit%20-%20Animals%20-%20google%20translate.txt.
A16. N-DMS ANTYPLAGIUS, Macedonian, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/MK%20-%20Brit%20-%20Animals%20-%20google%20translate.txt.
A17. N-DMS ANTYPLAGIUS, Kazakh, https://antyplagius.n-dms.com/tests/Belarusian-Ukrainian/CYR/KZ%20-%20Brit%20-%20Animals%20-%20google%20translate.txt.
A18. N-DMS, Antyplagius vs chatGPT-4 – Review of the film Troy [in Polish], YouTube, 10.04.2023, https://youtu.be/. ejk1xTPDDQ.
A19. N-DMS, Antyplagius vs chatGPT (part 1), YouTube, 27.04.2023, https://youtu.be/PxrVB9AwcR0.
A20. N-DMS, Are the Spanish and Romanian languages similar to each other? Test using the Antyplagiat N-DMS Antyplagius [in Polish], YouTube, 22.01.2022, https://youtu.be/JhfdwbyIsFc.