I'm trying to deal with many files in Python. I first need to get a list of all the files in a single directory. At the moment, I'm using:
os.listdir(dir)
However. This isn't feasible since the directory I'm searching has upward of 81,000 files in it, and totals almost 5 Gigabytes.
What's the best way of stepping through each file one-by-one? Without Windows deciding that the Python process is not responding and killing it? Because that tends to happen.
It's being run on a 32-bit Windows XP machine, so clearly it can't index more than 4 GB of RAM.
Any other ideas form anyone to solve this problem?
You may want to try using the scandir
module:
scandir
is a module which provides a generator version of os.listdir()
that also exposes the extra file information the operating system
returns when you iterate a directory. scandir
also provides a much
faster version of os.walk()
, because it can use the extra file
information exposed by the scandir()
function.
There's an accepted PEP proposing to merge it into the Python standard library, so it seems to have some traction.
Simple usage example from their docs:
def subdirs(path):
"""Yield directory names not starting with '.' under given path."""
for entry in os.scandir(path):
if not entry.name.startswith('.') and entry.is_dir():
yield entry.name
You could use glob.iglob
to avoid reading the entire list of filenames into memory. This returns a generator object allowing you to step through the filenames in your directory one by one:
import glob
files = glob.iglob(pathname\*)
for f in files:
# do something with f