How to use glob to only read limited set of files?
I have json files named numbers from 50 to 20000 (e.g. 50.json,51.json,52.json...19999.json,20000.json) within the same directory. I want to read only the files numbered from 15000 to 18000.
To do so I'm using a glob, as shown below, but it generates an empty list every time I try to filter out for the numbers. I've tried my best to follow this link (https://docs.python.org/2/library/glob.html), but I'm not sure what I'm doing wrong.
>>> directory = "/Users/Chris/Dropbox"
>>> read_files = glob.glob(directory+"/[15000-18000].*")
>>> print read_files
[]
Also, what if I wanted files with any number greater than 18000?
You are using the glob syntax incorrectly; the [..]
sequence works per character. The following glob would match your files correctly instead:
'1[5-8][0-9][0-9][0-9].*'
Under the covers, glob
uses fnmatch
which translates the pattern to a regular expression. Your pattern translates to:
>>> import fnmatch
>>> fnmatch.translate('[15000-18000].*')
'[15000-18000]\\..*\\Z(?ms)'
which matches 1 character before the .
, a 0
, 1
, 5
or 8
. Nothing else.
glob
patterns are quite limited; matching numeric ranges is not easy with it; you'd have to create separate globs for ranges, for example (glob('1[8-9][0-9][0-9][0-9]') + glob('2[0-9][0-9][0-9][0-9]')
, etc.).
Do your own filtering instead:
directory = "/Users/Chris/Dropbox"
for filename in os.listdir(directory):
basename, ext = os.path.splitext(filename)
if ext != '.json':
continue
try:
number = int(basename)
except ValueError:
continue # not numeric
if 18000 <= number <= 19000:
# process file
filename = os.path.join(directory, filename)
Although it hardly counts as beautiful code, you could implement your own filtering as follows:
import os, re
directory = "/Users/Chris/Dropbox"
all_files = os.listdir(directory)
read_files = [this_file for this_file in all_files
if (int(re.findall('\d+', this_file)[-1]) > 18000)]
print read_files
The crucial line here (should) iterate through each file name in the directory (for this_file in all_files
), pull out a list of number segments in that file name (re.findall('\d+', this_file)
), and include it in read_files
if the last of these number segments, as an integer, is greater than 18000.
I think this will break on files with no integers in the name, so user beware.
Edit: I see the previous answer has been edited to include what looks a much better thought out way to do this.