I have been working on an Implementation of a Plagiarism Detection Engine based on the academic paper behind MOSS(Measure of Software Similarity)
Link to MOSS
For designing a noise filter for a language like C/C++/Java, I have some decisions to make.
Are keywords relevant for detecting plagiarism or they should be removed?
Source files in same language are bound to share the same set of keywords. The paper does not discuss on how to deal with them.
How to deal with identifiers?
Replacing all keywords with a single character 'V' making matches independent of variable name makes sense.
What to do with package imports and library includes?
Whitespaces, Commments and punctuations are to be stripped definitely.
I am wondering after doing all the operations, the source file will be just a bunch of 'V' and some other garbled text.
What operations should the noise filter perform?
Insights and Opinions on the best way to deal with noise ?
For single functions: compile them, and compare the resulting assembler code or objects.
For a whole program: do the above for all the functions and create a fuzzy search to find back the fragments in a database of known functions and fragments.
So basically, you need to build a compiler, which emits a canonised representation of its input it, similar to P-code, but preferably human readable.
Some fragments are more characteristic than others, the fragment
for (i=0; i < 12345; i++) {
array[i] = 54321;
}
Will probably occur in some form in every program. It is 100% functional identical to
j=0;
while ( j < 12345) {
foobar[j++] = 54321;
}
, and a compiler would probably produce identical code.
There can be differences in variable-names, numerical constants, address constants, anything. But the "skeleton" of keywords (-> {comparisons, loops, expressions, assignments, function calls}) will be the same. So: don't drop the keywords, they are the scaffolding of a program.
There is quite a lot to find on google if you search for "text fingerprint shingle". A shingle is a x-word (x=7 in many research projects). You build a set of all shingles word by word.
You the build a hash over a shingle and then compare the 1000end of shingles in a text. It's pretty simple. There are a few things like special hash functions you for sure haven't heared outside this context etc.
Start with reading for example, it's not really rocket science but not trivial either.
"Text Origin Detection in an Efficient Way" Besnik Fetahu, Andreas Frische
http://resources.mpi-inf.mpg.de/d5/teaching/ws10_11/hir/reports/BesnikFetahu.pdf
"Algorithms for duplicate documents", Andrei Broder
http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/Princeton.pdf