copied-document-checker

Find out matched documents that are likely to be copied.


Keywords
copied, documents, plagiarism, plagiarize
License
MIT
Install
pip install copied-document-checker==1.2

Documentation

copied_document_checker

Description

It finds out copied documents among multiple documents in a folder.
[NOTE] This code can only accept file extentions of '.doc', '.docx'(ms word files), '.pdf'

Installation

pip install copied-document-checker

Dependencies

numpy, pandas, matplotlib, scikit-learn, pdfminer.six, docx, comtypes

Quick Start

import os
import copied_document_checker
from copied_document_checker import copied_doc_checker 

# path of the directory(folder) that contains the document files that you want to inspect. example_path = os.path.dirname(copied_document_checker.__file__) + '/students_homeworks_example' # you can put your directory print('\n# example_path: ', example_path, end='\n\n')

# run checker = copied_doc_checker.CopiedDocumentChecker(example_path) checker.run(n_top_likely=15) # number of documents that are the most likely to be copied.

Based Algorithms/Knowledge

Document parsing: n-gram parsing, Bag Of Words (BOW)
Measuring similarity: euclidean distance (modified by giving additional penalties for the matched word counts)