What duplication detection threshold do you use? [

2019-04-10 21:12发布

问题:

We all agree that duplication is evil and should be avoid (Don't Repeat Yourself principle). To ensure that, static analysis code should be used like Simian (Multi Language) or Clone Detective (Visual Studio add-in)

I just read Ayende's post about Kobe where he is saying that :

8.5% of Kobe is copy & pasted code. And that is with the sensitivity dialed high, if we set the threshold to 3, which is what I commonly do, is goes up to 12.5%.

I think that 3 as threshold is very low. In my company we offer quality code analysis as a service, our default threshold for duplication is set to 20 and there is a lot of duplications. I can't imagine if we set it to 3, it would be impossible for our customer to even think about correction.

I understand Ayende's opinion about Kobe: it's an official sample and is marketed as “intended to guide you with the planning, architecting, and implementing of Web 2.0 applications and services.” so the expectation of quality is high.

But for your project what minimum threshold do you use for duplication?

Related question : How fanatically do you eliminate Code Duplication?

回答1:

Three is a good rule of thumb, but it depends. Refactoring to eliminate duplication often involves trading conceptual simplicity of the codebase and API for a smaller codebase that is more maintainable once someone does understand it. I generally evaluate things in this light.

At one extreme, if fixing the duplication makes the code more readable and adds little or nothing to the conceptual complexity of the code, then any duplication is unacceptable. An example of this would be whenever the duplicated code factors out neatly into a simple referentially transparent function that does something that's easy to explain and name.

When a more complex, heavyweight solution, such as metaprogramming, OO design patterns, etc. is necessary, I may allow 4 or 5 instances, especially if the duplicated snippet is small. In these cases I feel that the conceptual complexity of the solution makes the cure worse than the ill until there are really a lot of instances.

In the most extreme case, where the codebase I'm working with is a very rapidly evolving prototype and I don't know enough about what direction the project may evolve in to draw abstraction lines that are both reasonably simple and reasonably future-proof, I just give up. In a codebase like this, I think it's better to just focus on expediency and getting things done than good design, even if the same piece of code is duplicated 20 times. Often the parts of the prototype that are creating all that duplication are the ones that will be discarded relatively soon anyhow, and once you know what parts of the prototype will be kept, you can always refactor these. Without the additional constraints created by the parts that will be discarded, refactoring is often easier at this stage.



回答2:

I don't know what is a good 'metric' for it, but I would say the thing I usually strive for is

  • if you have the same code in two places, and
  • the code is the same by intent (rather than merely coincidentally the same)

then refactor to get rid of the duplication. All duplication is bad. I'll rarely let code be in two places, and at three, it definitely has to go.



回答3:

I'm personally pretty fanatic about it. I try to design my projects to avoid code duplication. I do have the goal to get the threshold in the lower single digits, if I can't get there it means my design is not well enough and I need to go back to the drawing board or refactor.



回答4:

Depends on the programming language. (The "Clone Detective" guy seems to recognize this: "programming language constraints" is one of the boxes in his first presentation.)

In a Lisp program, any duplicate expression is easily subject to refactoring -- I guess you'd call that threshold 2. Everything is comprised of expressions, and there are macros to translate expressions, so there's rarely an excuse for duplicating anything even once. (About the only thing I can think of which would be hard to extract would be something like LOOP clauses, and in fact many people advocate avoiding LOOP for just that reason.)

In many other languages, programs consist of classes which have methods which have statements, so it's harder to just pull out an expression and use it in two different files. Often it means changing the structure of the thing as you extract it. Often there's also a requirement of type safety, which can be limiting (unless you want to write a whole lot of reflection code all the time to escape it, but if you do, you shouldn't be using a static language). If I made my current statically-typed program perfectly DRY, it would be neither shorter nor easier to maintain.

I guess the upshot of this is that what we really want is "easy to maintain". Sometimes, in some languages, that means "just put a comment here that says what you copied and why". DRY is a good indicator of maintainable code. If you're repeating a lot, that's not a good sign. But chasing any single statistic also tends to be bad -- otherwise, we'd solve all our problems by simply optimizing for that.



回答5:

I think we need to combine

  • Number of lines that is duplicate
  • Number of times duplicate has been copied
  • How “close” are the duplicates to each other
    e.g if there are in different products that just happen to be in the same source code control system, it is very different then if they are in the same method.
  • Time since any of the methods that contain the duplicate code have been changed.

So as to get a good trade-off between cost/benefit of removing the duplicates.



回答6:

Our CloneDR finds duplicate code, both exact copies and near-misses, across large source systems, parameterized by langauge syntax. It supports Java, C#, COBOL, C++, PHP and many other languages.

It accepts a number of parameters to define "What is a clone?", including: a) Similarilty threshold, controlling how similar two blocks of code must be to be declaraed clones (typically 95% is good) b) number of lines minimum clone size (3 tends to be a good choice) c) number of parameters (distinct changes to the text; 5 tends to be a good choice) With these settings, it tends to find 10-15% redundant code in virtually everything it processes.