Similarity detection based on document matrix model and edit distance algorithm

Artur Niewiarowski

doi:10.24423/cames.277

Authors

Artur Niewiarowski Department of Computer Science, Faculty of Computer Science and Telecommunications, Cracow University of Technology, Kraków, Poland

Abstract

This paper presents a new algorithm with an objective of analyzing the similarity measure between two text documents. Specifically, the main idea of the implemented method is based on the structure of the so-called “edit distance matrix” (similarity matrix). Elements of this matrix are filled with a formula based on Levenshtein distances between sequences of sentences. The Levenshtein distance algorithm (LDA) is used as a replacement for various implementations of stemming or lemmatization methods. Additionally, the proposed algorithm is fast, precise, and may be implemented for analyzing very large documents (e.g., books, diploma works, newspapers, etc.). Moreover, it seems to be versatile for the most common European languages such as Polish, English, German, French and Russian. The presented tool is intended for all employees and students of the university to detect the level of similarity regarding analyzed documents. Results obtained in the paper were confirmed in the tests shown in the article.

Keywords:

plagiarism detection, plagiarism system, edit distance, Levenshtein distance, similarity measure, text mining, information retrieval

References

[1] S. Albitar, S. Fournier, B. Espinasse. An effective TF/IDF-based text-to-text semantic similarity measure for text classification. In: Proceedings of the Web information systems engineering – WISE 2014. Part I, pp. 105–114, 15th International Conference, Thessaloniki, Greece, October 12–14, 2014.

[2] J. Dawson. Suffix removal and word conflation. ALLC Bulletin, 2(3): 33–46, 1974.

[3] A. Dziob, M. Piasecki. Implementation of the verb model in plWordNet 4.0. In: Proceedings of the 9th Global Wordnet Conference, pp. 114–123, Singapore, 8–12 January 2018.

[4] R. Gabrys, E. Yaakobi, O. Milenkovic. Codes in the Damerau distance for DNA storage. In: 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, pp. 2644–2648, 2016, https://doi.org/10.1109/ISIT.2016.7541778

[5] R. Krovetz. Viewing morphology as an inference process. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, Pittsburgh, USA, pp. 191–202, 1993.

[6] V.I. Levenshtein. Binary codes for correcting dropouts, inserts, and symbol substitutions [in Russian]. Reports of the Academy of Sciences of the USSR, 163(4): 845–848, 1965.

[7] J. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1–2): 22–31, 1968.

[8] M.M. Maulana, R. Arifudin, A. Alamsyah. Autocomplete and spell checking Levenshtein distance algorithm to getting Text Suggest Error Data Searching in Library. Scientific Journal of Informatics, 5(1): 67–75, 2018.

[9] M. Maziarz, M. Piasecki, E. Rudnicka. plWordnet – the proces of making a thesaurus [in Polish: Słowosieć – polski wordnet. Proces tworzenia tezaurusa]. Polonica, 34: 79–98, 2014.

[10] A. Niewiarowski. Short text similarity algorithm based on the edit distance and thesaurus [in Polish: Algorytm podobieństwa krótkich fragmentów tekstów oparty na odległości edycyjnej i słowniku wyrazów bliskoznacznych]. Czasopismo Techniczne. Nauki Podstawowe/Technical Transactions/Fundamental Sciences. 1: 159–173, 2016.

[11] M. Piasecki, T. Walkowiak, M. Eder. Open stylometric system WebSty: integrated language processing, analysis and visualisation. Computational Methods in Science and Technology, 24(1): 43–58, 2018.

[12] M.F. Porter. An algorithm for suffix stripping. Program: electronic library and information systems, 14(3): 130–137, 1980.

[13] T.A. Runkler, J.C. Bezdek. Web mining with relational clustering: International Journal of Approximate Reasoning, 32(2–3): 217–236, 2003.

[14] R.S. Sandeep, Ch. Vinay, M.S. Hemant. Strength and accuracy analysis of affix removal stemming algorithms. International Journal of Computer Science and Information Technologies, 4(2): 265–269, 2013.

[15] V. Sandulescu, M. Ester. Detecting Singleton review spammers using semantic similarity. In: Proceeding WWW’15 Companion Proceedings of the 24th International Conference on World Wide Web, pp. 971–976, May 18–22, 2015, Florence, Italy.

[16] B. Wahiba, K. Abdessalem. A new stemmer to improve information retrieval. International Journal of Network Security & Its Applications, 5(4): 143–154, 2013.

[17] D. Weiss. A survey of freely available Polish stemmers and evaluation of their applicability in information retrieval. In: 2nd Language and Technology Conference, pp. 216–221, Poznań, Poland, 2005.

[18] D. Weiss. Stempelator: A hybrid stemmer for the Polish language. Research Report RA-002/05. Institute of Computing Science, Poznań University of Technology, Poland, 2005.

Online first
Accepted manuscripts
2026, Vol 33
	No 2	No 1
2025, Vol 32
	No 1	No 2	No 3	No 4
2024, Vol 31
	No 1	No 2	No 3	No 4
2023, Vol 30
	No 1	No 2	No 3	No 4
2022, Vol 29
	No 1-2		No 3	No 4
2021, Vol 28
	No 1	No 2	No 3	No 4
2020, Vol 27
	No 1	No 2-3		No 4
2019, Vol 26
	No 1	No 2	No 3-4
2018, Vol 25
	No 1	No 2-3		No 4
2017, Vol 24
	No 1	No 2	No 3	No 4
2016, Vol 23
	No 1	No 2-3		No 4
2015, Vol 22
	No 1	No 2	No 3	No 4
2014, Vol 21
	No 1	No 2	No 3-4
2013, Vol 20
	No 1	No 2	No 3	No 4
2012, Vol 19
	No 1	No 2	No 3	No 4
2011, Vol 18
	No 1-2		No 3	No 4
2010, Vol 17
	No 1	No 2/3/4
2009, Vol 16
	No 1	No 2	No 3-4
2008, Vol 15
	No 1	No 2	No 3-4
2007, Vol 14
	No 1	No 2	No 3	No 4
2006, Vol 13
	No 1	No 2	No 3	No 4
2005, Vol 12
	No 1	No 2-3		No 4
2004, Vol 11
	No 1	No 2-3		No 4
2003, Vol 10
	No 1	No 2	No 3	No 4
2002, Vol 9
	No 1	No 2	No 3	No 4
2001, Vol 8
	No 1	No 2-3		No 4
2000, Vol 7
	No 1	No 2	No 3	No 4
1999, Vol 6
	No 1	No 2	No 3-4
1998, Vol 5
	No 1	No 2	No 3	No 4
1997, Vol 4
	No 1	No 2	No 3-4
1996, Vol 3
	No 1	No 2	No 3	No 4
1995, Vol 2
	No 1	No 2	No 3	No 4
1994, Vol 1
	No 1-2		No 3-4

Similarity detection based on document matrix model and edit distance algorithm

Downloads

Authors

Abstract

Keywords:

References

Other articles by the same author(s)

cover

ippt-pan

Issue

Pages

Section

DOI

Received

Accepted

Published

License

How to Cite

Principal Contact

Address

Support Contact