Similar code detector

2019-03-11 03:33发布

问题:

I'm search for a tool that could compare source codes for similarity.

We have a very trivial system right now that has huge amount of false positives and the real positives can easily get buried in them.

My requirements are:

  • reasonably small amount of false positives
  • good detection rate (yeah these are going against each other)
  • ideally with a more complex output than just a single value
  • usable for C (C99) and C++ (C++03 and optimally C++11)
  • still maintained
  • usable for comparing two source files against each other
  • usable in non-interactive mode

EDIT:

To avoid confusion, the following two code snippets are identical and should be detected as such:

for (int i = 0; i < 10; i++) { bla; }

int i; while (i < 10) { bla; i++; }

The same here:

int x = 10; y = x + 5;

int a = 10; y = a + 5;

回答1:

I've used MOSS in the past: http://theory.stanford.edu/~aiken/moss/ to detect plagiarized code. Since it works on a semantic level, it will detect the situations you presented above. The tool is language-aware, so comments are not considered in the analysis, and it goes a long way in detecting code that has been modified through simple search-and-replace of variable and/or function names.

Note: I used the tool a few years ago when I taught computer science in grad school, and it worked wonderfully in detecting code that had been yanked from the internet. Here is a well-documented account of similar application: http://fie2012.org/sites/fie2012.org/history/fie99/papers/1110.pdf

If you google "measure software similarity", you should find a few more useful hits: http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/detectiontools_sourcecode.html



回答2:

Your problem in Computer Science Terminology maybe stated as Source Code Plagiarism Detection. A good start would be to read this article on Dr Dobbs: Detecting Source-Code Plagiarism. It lists the Algorithms for detecting Plagiarism in the source code.

Note: What you have asked for is indeed a tough computing problem :)



回答3:

May be Copy-paste-detector from PMD?



回答4:

You could try duplo. It will find common lines. It has some ability to ignore whitespace changes, but doesn't detect code with renamed variables, so it is more a cleanup-aid than a help when detecting plagiarism.



回答5:

I start to use JPLAG (https://github.com/jplag/jplag) to check code similarity and compare students works in Java and text files. It works well to check same code structure and variable Substitution.