I need to walk through a folder with approximately ten thousand files. My old vbscript is very slow in handling this. Since I've started using Ruby and Python since then, I made a benchmark between the three scripting languages to see which would be the best fit for this job.
The results of the tests below on a subset of 4500 files on a shared network are
Python: 106 seconds
Ruby: 5 seconds
Vbscript: 124 seconds
That Vbscript would be slowest was no surprise but I can't explain the difference between Ruby and Python. Is my test for Python not optimal? Is there a faster way to do this in Python?
The test for thumbs.db is just for the test, in reality there are more tests to do.
I needed something that checks every file on the path and doesn't produce too much output to not disturb the timing. The results are a bit different each run but not by much.
#python2.7.0
import os
def recurse(path):
for (path, dirs, files) in os.walk(path):
for file in files:
if file.lower() == "thumbs.db":
print (path+'/'+file)
if __name__ == '__main__':
import timeit
path = '//server/share/folder/'
print(timeit.timeit('recurse("'+path+'")', setup="from __main__ import recurse", number=1))
'vbscript5.7
set oFso = CreateObject("Scripting.FileSystemObject")
const path = "\\server\share\folder"
start = Timer
myLCfilename="thumbs.db"
sub recurse(folder)
for each file in folder.Files
if lCase(file.name) = myLCfilename then
wscript.echo file
end if
next
for each subfolder in folder.SubFolders
call Recurse(subfolder)
next
end Sub
set folder = oFso.getFolder(path)
recurse(folder)
wscript.echo Timer-start
#ruby1.9.3
require 'benchmark'
def recursive(path, bench)
bench.report(path) do
Dir["#{path}/**/**"].each{|file| puts file if File.basename(file).downcase == "thumbs.db"}
end
end
path = '//server/share/folder/'
Benchmark.bm {|bench| recursive(path, bench)}
EDIT: since i suspected the print caused a delay i tested the scripts with printing all 4500 files and also printing none, the difference remains, R:5 P:107 in the first case and R:4.5 P:107 in the latter
EDIT2: based on the answers and comments here a Python version that in some cases could run faster by skipping folders
import os
def recurse(path):
for (path, dirs, files) in os.walk(path):
for file in files:
if file.lower() == "thumbs.db":
print (path+'/'+file)
def recurse2(path):
for (path, dirs, files) in os.walk(path):
for dir in dirs:
if dir in ('comics'):
dirs.remove(dir)
for file in files:
if file.lower() == "thumbs.db":
print (path+'/'+file)
if __name__ == '__main__':
import timeit
path = 'f:/'
print(timeit.timeit('recurse("'+path+'")', setup="from __main__ import recurse", number=1))
#6.20102692
print(timeit.timeit('recurse2("'+path+'")', setup="from __main__ import recurse2", number=1))
#2.73848228
#ruby 5.7