The growing and well-disseminated use of data-processing tools has allowed many scientists to recklessly copy-paste the works written by other researchers, in extents that vary from single words up to entire paragraphs or pages, without even mentioning the identity of real source. Fortunately, the same technology, which enables people to exploit others’ papers without breaking a sweat, also provides the scientific world with a considerable variety of instruments to blame guilty the people who committed the crime of plagiarism.
Plagiarism Detection Methods
Plagiarism can be individuated by two different strategies:
- Manual: this is obviously impracticable on the large scale, as the required mnemonic effort is excessively onerous, particularly if the amount of works to analyse is conspicuous;
- Supported by informatics: this is a rapid an effective approach. A number of schemes can be pursued, differing in terms of the type of plagiarism they aim to detect and the typology of comparisons they perform. A general overview of all the possible in silico approaches is provided by Table 2.
Plagiarism Detection Methods
- Local Similarity Assessment
- Fingerprinting
- Global Similarity Assessment
- Citation-Based Stylometry
- Plagiarism Detection
- Term Occurrence
- Analysis
- Substring Matching
- Bag of Words Analysis
- Citation Pattern Analysis
Essentially, there are two main types of anti-plagiarism software: a local similarity assessment, which confronts single text fragments at a time, and a global similarity assessment, which simultaneously considers larger parts of the text. Subsequently, several strategies are to be performed in order to detect the fraud:
Fingerprinting: the text is divided (digested) into multiple fragments, which are compared against a database where books, articles and essays have been stored with a similar digestion criterion. Finally, the program identifies the presence of identical fragmented items. This approach is the gold standard, and the majority of today’s plagiarism search engines rely on such strategy.
Substring matching: this technique is rather similar to fingerprinting. However, it requires longer fragments and more complicated algorithms. Clearly, the computational cost required by this method is significant, thus it is not within the general user’s reach.
Bag of words: an algorithm codifies the vocabulary of a document by a probability distribution, basically a sort of histogram. Subsequently, a comparison between histograms from different texts is performed and hypothetical similarities are individuated.
Citation-based analysis: it does not consider the text, but only the citations that it contains. For each paper, it builds a tree diagram of the citations, recursively individuating the sequence of cited papers. Finally, diagrams from different articles are compared and similarities are revealed. This approach is extremely pioneering as affinities in citations could disguise a possible plagiarism of ideas. Unfortunately, citation-based analysis is still in its embryonic phase and its efficiency is tremendously low. Nonetheless, its future applications and improvements may be astonishing.
Stylometry: this method employs a statistical approach to express the author’s style in terms of parameters. Hence, a comparison between the parameters (stylometric models) filed in a database against those extracted from the text is performed, and any similarities are pinpointed. It is especially useful when the paper is particularly long (in the order of thousands of pages).
Trans-language: it searches from text translated from different languages. This technology is in a start-up phase, but its applications should be taken into consideration in the very near future.
Although their theoretical efficacy, anti-plagiarism software products have strong practical limitations.
False negatives: at present, it is almost impossible for these programs to detect a text which has been rewritten, translated, paraphrased or expressed with synonyms (non- “word-for-word” plagiarism). John Barrie, co-creator of Turnitin, one of the most widely used plagiarism search engine in the world, stated that a rewritten text would not be detected by his program if a minimum of 1 every 3 words has been purposely changed.
False positives: it may frequently happen that some proverbial expressions or customary terminology are detected as fraudulent, while being accepted by common sense as nonfraudulent. To sum up, plagiarism search engines are an invaluable tool to detect crass copying in a simple and effective way, which fortunately represent the majority of cases. On the contrary, at present development state, they are ineffective towards more refined typologies of fraud, as depicted in Figure 14. Therefore, wise human interpretation and supervision remains of most critical importance.
To make things even harder, some technological adaptations allow their users to further elude anti-plagiarisms products. File texts of the most common formats, such as PDF or Microsoft Word® can be re-codified by advanced users in order to keep the visual rendering unchanged, while becoming impregnable to most anti-plagiarism tools. The most effective tricks are:
- Map alteration of the textual font;
- The use of glyphs: they are text fonts that are rendered on the screen with the same appearance as the alphabet letters, but are informatically encoded by a different code set. In other words, the same letters are associated to character codes different from the usual maps, such as the ASCII, Unicode or UTF-8 encodings. Hence, when a comparison between strings composed by normal alphabet character is carried out, the glyph text is detected to be totally different, although the graphic look is the very same, as it is shown in Figure 15;
- Text conversion by Bezier curves, which are curves whose output has the form of a letter but with a different graphic encoding.
Such techniques could be overcome most simply, by using an Optical Character Recognition (OCR) software before the analysis with plagiarism search engines. A recent investigation by Dr. James Heather of University of Surrey suggests that every university should routinely operate this method in order to detect and rule out a higher number of misleading papers.
The most widely used anti-plagiarism products are PlagTracker, Viper, Scan my Essays, AntiPlagiarism, DupliChecker, PaperRater, Plagiarisma.net, Plagiarism Checker, Plagium, SeeSources, Plagiarism Detector. They are freely accessible and it is strongly suggested to always perform the analysis on multiple websites, so that the number of false negatives is minimized.
This essay was donated by a student on 4.09.2018 in exchange for a free plagiarism scan.