copied_document_checker
It finds out copied documents among multiple documents in a folder. Description
[NOTE] This code can only accept file extentions of '.doc', '.docx'(ms word files), '.pdf'
pip install copied-document-checker Installation
numpy, pandas, matplotlib, scikit-learn, pdfminer.six, docx, comtypes Dependencies
Quick Start
import os import copied_document_checker from copied_document_checker import copied_doc_checker
# path of the directory(folder) that contains the document files that you want to inspect. example_path = os.path.dirname(copied_document_checker.__file__) + '/students_homeworks_example' # you can put your directory print('\n# example_path: ', example_path, end='\n\n')
# run checker = copied_doc_checker.CopiedDocumentChecker(example_path) checker.run(n_top_likely=15) # number of documents that are the most likely to be copied.
Document parsing: n-gram parsing, Bag Of Words (BOW) Based Algorithms/Knowledge
Measuring similarity: euclidean distance (modified by giving additional penalties for the matched word counts)