Finding dead code in large python project [closed]

2019-01-31 11:39发布

问题:

I've seen How can you find unused functions in Python code? but that's really old, and doesn't really answer my question.

I have a large python project with multiple libraries that are shared by multiple entry point scripts. This project has been accreting for many years with many authors, so there's a whole lot of dead code. You know the drill.

I know that finding all dead code is un-decidable. All I need is a tool that will find all functions that are not called anywhere. We're not doing anything fancy with calling functions based on the string of the function name, so I'm not worried about anything pathological...

I just installed pylint, but it appears to be file based, and not paying much attention to interfile dependencies, or even function dependencies.

Clearly, I could grep for def in all of the files, get all of the function names from that, and do a grep for each of those function names. I'm just hoping that there's something a little smarter than that out there already.

ETA: Please note that I don't expect or want something perfect. I know my halting-problem-proof just as well anyone (No really I taught theory of computation I know when I'm looking at something that is recursively enumerable). Any thing that tries to approximate it by actually running the code is going to take way too long. I just want something that syntactically goes through the code and says "This function is definitely used. This function MIGHT be used, and this function is definitely NOT used, no one else even seems to know it exists!" And the first two categories aren't important.

回答1:

You might want to try out vulture. It can't catch everything due to Python's dynamic nature, but it catches quite a bit without needing a full test suite like coverage.py and others need to work.



回答2:

Try running Ned Batchelder's coverage.py.

Coverage.py is a tool for measuring code coverage of Python programs. It monitors your program, noting which parts of the code have been executed, then analyzes the source to identify code that could have been executed but was not.



回答3:

It is very hard to determine which functions and methods are called without executing the code, even if the code doesn't do any fancy stuff. Plain function invocations are rather easy to detect, but method calls are really hard. Just a simple example:

class A(object):
    def f(self):
        pass

class B(A):
    def f(self):
        pass

a = []
a.append(A())
a.append(B())
a[1].f()

Nothing fancy going on here, but any script that tries to determine whether A.f() or B.f() is called will have a rather hard time to do so without actually executing the code.

While the above code doesn't do anything useful, it certainly uses patterns that appear in real code -- namely putting instances in containers. Real code will usually do even more complex things -- pickling and unpickling, hierarchical data structures, conditionals.

As stated before, just detecting plain function invocations of the form

function(...)

or

module.function(...)

will be rather easy. You can use the ast module to parse your source files. You will need to record all imports, and the names used to import other modules. You will also need to track top-level function definitions and the calls inside these functions. This will give you a dependency graph, and you can use NetworkX to detect the connected components of this graph.

While this might sound rather complex, it can probably done with less than 100 lines of code. Unfortunately, almost all major Python projects use classes and methods, so it will be of little help.



回答4:

Here's the solution I'm using at least tentatively:

grep 'def ' *.py > defs
# ...
# edit defs so that it just contains the function names
# ...
for f in `cat defs` do
    cat $f >> defCounts
    cat *.py | grep -c $f >> defCounts
    echo >> defCounts
done

Then I look at the individual functions that have very few references (< 3 say)

it's ugly, and it only gives me approximate answers, but I think it's good enough for a start. What are you-all's thoughts?



回答5:

With the following line you can list all function definitions that are obviously not used as an attribute, a function call, a decorator or a return value. So it is approximately what you are looking for. It is not perfect, it is slow, but I never got any false positives. (With linux you have to replace ack with ack-grep)

for f in $(ack --python --ignore-dir tests -h --noheading "def ([^_][^(]*).*\):\s*$" --output '$1' | sort| uniq); do c=$(ack --python -ch "^\s*(|[^#].*)(@|return\s+|\S*\.|.*=\s*|)"'(?<!def\s)'"$f\b"); [ $c == 0 ] && (echo -n "$f: "; ack --python --noheading "$f\b"); done


回答6:

If you have your code covered with a lot of tests (it is quite useful at all), run them with code-coverage plugin and you can see unused code then .)



回答7:

IMO that could be achieved pretty quickly with a simple pylint plugin that :

  • remember each analysed function / method (/ class ?) in a S1 set
  • track each called function / method (/ class ?) in a S2 set
  • display S1 - S2 in a report

Then you would have to call pylint on all your code base to get something that make sense. Of course as said this would need to checked, as there may have been inference failures or such that would introduce false positive. Anyway that would probably greatly reduce the number of grep to be done.

I've not much time to do it myself yet but anyone would find help on the python-projects@logilab.org mailing list.