I'm messing around with file lookups in python on a large hard disk. I've been looking at os.walk and glob. I usually use os.walk as I find it much neater and seems to be quicker (for usual size directories).
Has anyone got any experience with them both and could say which is more efficient? As I say, glob seems to be slower, but you can use wildcards etc, were as with walk, you have to filter results. Here is an example of looking up core dumps.
core = re.compile(r"core\.\d*")
for root, dirs, files in os.walk("/path/to/dir/")
for file in files:
if core.search(file):
path = os.path.join(root,file)
print "Deleting: " + path
os.remove(path)
Or
for file in iglob("/path/to/dir/core.*")
print "Deleting: " + file
os.remove(file)
Don't waste your time for optimization before measuring/profiling. Focus on making your code simple and easy to maintain.
For example, in your code you precompile RE, which does not give you any speed boost, because re module has internal
re._cache
of precompiled REs.Note, that some optimization done several years prior can make code run slower compared to "non-optimized" code. This applies especially for modern JIT based languages.
*, ?, and character ranges expressed with [] will be correctly matched. This is done by using the os.listdir() and fnmatch.fnmatch() functions
I think even with glob you would still have to
os.walk
, unless you know directly how deep your subdirectory tree is.Btw. in the glob documentation it says:
I would simply go with a
You can use os.walk and still use glob-style matching.
Not sure about speed, but obviously since os.walk is recursive, they do different things.
I made a research on a small cache of web pages in 1000 dirs. The task was to count a total number of files in dirs. The output is:
As you see,
os.listdir
is quickest of three. Andglog.glob
is still quicker thanos.walk
for this task.The source: