Similarity detection based on document matrix model and edit distance algorithm

  • Artur Niewiarowski Cracow University of Technology

Abstract

This paper presents a new algorithm with an objective of analyzing the similarity measure between two text documents. Specifically, the main idea of the implemented method is based on the structure of the so-called “edit distance matrix” (similarity matrix). Elements of this matrix are filled with a formula based on Levenshtein distances between sequences of sentences. The Levenshtein distance algorithm (LDA) is used as a replacement for various implementations of stemming or lemmatization methods. Additionally, the proposed algorithm is fast, precise, and may be implemented for analyzing very large documents (e.g., books, diploma works, newspapers, etc.). Moreover, it seems to be versatile for the most common European languages such as Polish, English, German, French and Russian. The presented tool is intended for all employees and students of the university to detect the level of similarity regarding analyzed documents. Results obtained in the paper were confirmed in the tests shown in the article.

Keywords

plagiarism detection, plagiarism system, edit distance, Levenshtein distance, similarity measure, text mining, information retrieval,
Published
Jan 9, 2020
How to Cite
NIEWIAROWSKI, Artur. Similarity detection based on document matrix model and edit distance algorithm. Computer Assisted Methods in Engineering and Science, [S.l.], v. 26, n. 3–4, p. 163–175, jan. 2020. ISSN 2299-3649. Available at: <https://cames.ippt.pan.pl/index.php/cames/article/view/277>. Date accessed: 19 jan. 2020.
Section
Articles