Similarity detection based on document matrix model and edit distance algorithm

  • Artur Niewiarowski Cracow University of Technology

Abstract

This paper presents a new algorithm with an objective of analyzing the similarity measure between two text documents. Specifically, the main idea of the implemented method is based on the structure of the so-called “edit distance matrix” (similarity matrix). Elements of this matrix are filled with a formula based on Levenshtein distances between sequences of sentences. The Levenshtein distance algorithm (LDA) is used as a replacement for various implementations of stemming or lemmatization methods. Additionally, the proposed algorithm is fast, precise, and may be implemented for analyzing very large documents (e.g., books, diploma works, newspapers, etc.). Moreover, it seems to be versatile for the most common European languages such as Polish, English, German, French and Russian. The presented tool is intended for all employees and students of the university to detect the level of similarity regarding analyzed documents. Results obtained in the paper were confirmed in the tests shown in the article.

Keywords

plagiarism detection, plagiarism system, edit distance, Levenshtein distance, similarity measure, text mining, information retrieval,

References

[1] S. Albitar, S. Fournier, B. Espinasse. An effective TF/IDF-based text-to-text semantic similarity measure for text classification. In: Proceedings of the Web information systems engineering – WISE 2014. Part I, pp. 105–114, 15th International Conference, Thessaloniki, Greece, October 12–14, 2014.
[2] J. Dawson. Suffix removal and word conflation. ALLC Bulletin, 2(3): 33–46, 1974.
[3] A. Dziob, M. Piasecki. Implementation of the verb model in plWordNet 4.0. In: Proceedings of the 9th Global Wordnet Conference, pp. 114–123, Singapore, 8–12 January 2018.
[4] R. Gabrys, E. Yaakobi, O. Milenkovic. Codes in the Damerau distance for DNA storage. In: 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, pp. 2644–2648, 2016, https://doi.org/10.1109/ISIT.2016.7541778.
[5] R. Krovetz. Viewing morphology as an inference process. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, Pittsburgh, USA, pp. 191–202, 1993.
[6] V.I. Levenshtein. Binary codes for correcting dropouts, inserts, and symbol substitutions [in Russian: Dvoiqnye kody s ispravleniem vypadeni$i, vstavok i zameweni$i simvolov]. Reports of the Academy of Sciences of the USSR, 163(4): 845–848, 1965.
[7] J. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1–2): 22–31, 1968.
[8] M.M. Maulana, R. Arifudin, A. Alamsyah. Autocomplete and spell checking Levenshtein distance algorithm to getting Text Suggest Error Data Searching in Library. Scientific Journal of Informatics, 5(1): 67–75, 2018.
[9] M. Maziarz, M. Piasecki, E. Rudnicka. plWordnet – the proces of making a thesaurus [in Polish: Słowosiec – polski wordnet. Proces tworzenia tezaurusa]. Polonica, 34: 79–98, 2014.
[10] A. Niewiarowski. Short text similarity algorithm based on the edit distance and thesaurus [in Polish: Algorytm podobienstwa krótkich fragmentów tekstów oparty na odległosci edycyjnej i słowniku wyrazów bliskoznacznych]. Czasopismo Techniczne. Nauki Podstawowe/Technical Transactions/Fundamental Sciences. 1: 159–173, 2016.
[11] M. Piasecki, T. Walkowiak, M. Eder. Open stylometric system WebSty: integrated language processing, analysis and visualisation. Computational Methods in Science and Technology, 24(1): 43–58, 2018.
[12] M.F. Porter. An algorithm for suffix stripping. Program: electronic library and information systems, 14(3): 130–137, 1980.
[13] T.A. Runkler, J.C. Bezdek. Web mining with relational clustering: International Journal of Approximate Reasoning, 32(2–3): 217–236, 2003.
[14] R.S. Sandeep, Ch. Vinay, M.S. Hemant. Strength and accuracy analysis of affix removal stemming algorithms. International Journal of Computer Science and Information Technologies, 4(2): 265–269, 2013.
[15] V. Sandulescu, M. Ester. Detecting Singleton review spammers using semantic similarity. In: Proceeding WWW’15 Companion Proceedings of the 24th International Conference on World Wide Web, pp. 971–976, May 18–22, 2015, Florence, Italy.
[16] B. Wahiba, K. Abdessalem. A new stemmer to improve information retrieval. International Journal of Network Security & Its Applications, 5(4): 143–154, 2013.
[17] D. Weiss. A survey of freely available Polish stemmers and evaluation of their applicability in information retrieval. In: 2nd Language and Technology Conference, pp. 216–221, Poznan, Poland, 2005.
[18] D. Weiss. Stempelator: A hybrid stemmer for the Polish language. Research Report RA-002/05. Institute of Computing Science, Poznan University of Technology, Poland, 2005.
Published
Dec 31, 2019
How to Cite
NIEWIAROWSKI, Artur. Similarity detection based on document matrix model and edit distance algorithm. Computer Assisted Methods in Engineering and Science, [S.l.], v. 26, n. 3–4, p. 163–175, dec. 2019. ISSN 2956-5839. Available at: <https://cames.ippt.pan.pl/index.php/cames/article/view/277>. Date accessed: 19 apr. 2024. doi: http://dx.doi.org/10.24423/cames.277.
Section
Articles