edit: see the bottom for my eventual solution
I have a directory of ~12,700 text files.
They have names like this:
1 - Re/ Report Novenator public call for bury - by Lizbett on Thu, 10 Sep 2009.txt
Where the leading digital increments with each file (e.g. the last file in the directory begins with "12,700 - ").
Unfortunately, the files are not timesorted, and I need them to be. Luckily I have a separate CSV file where the ID numbers are mapped e.g. the 1 in the example above should really be 25 (since there are 24 messages before it), and 2 should really be 8, and 3 should be 1, and so forth, like so:
OLD_FILEID TIMESORT_FILEID
21 0
23 1
24 2
25 3
I don't need to change anything in the file title except for this single leading number which I need to swap with its associated value. In my head, the way this would work is to open a file name, check the digits which appear before the dash, look them up in the CSV, replace them with the associated value, and then save the file with the adjusted title and go on to the next file.
What would be the best way to go about doing something like this? I'm a python newbie but have played around enough to feel comfortable following most directions or suggestions. Thanks :)
e: following the instructions below as best I could I did this, which doesn't work, but I'm not sure why:
import os
import csv
import sys
#open and store the csv file
with open('timesortmap.csv','rb') as csvfile:
timeReader = csv.reader(csvfile, delimiter = ',', quotechar='"')
#get the list of files
for filename in os.listdir('DiggOutput-TIMESORT/'):
oldID = filename.split(' - ')[0]
newFilename = filename.replace(oldID, timeReader[oldID],1)
os.rename(oldID, newFilename)
The error I get is:
TypeError: '_csv.reader' object is not subscriptable
I am not using DictReader, but that's because when I use csv.reader and print the rows, it looks like this:
['12740', '12738']
['12742', '12739']
['12738', '12740']
['12737', '12741']
['12739', '12742']
And when I use DictReader it looks like this:
{'FILEID-TS': '12738', 'FILEID-OLD': '12740'}
{'FILEID-TS': '12739', 'FILEID-OLD': '12742'}
{'FILEID-TS': '12740', 'FILEID-OLD': '12738'}
{'FILEID-TS': '12741', 'FILEID-OLD': '12737'}
{'FILEID-TS': '12742', 'FILEID-OLD': '12739'}
And I get this error in terminal:
File "TimeSorter.py", line 16, in <module>
newFilename = filename.replace(oldID, timeReader[oldID],1)
AttributeError: DictReader instance has no attribute '__getitem__'
Here's what I ended up working out with friends, should anyone find and look for this:
This should really be very simple to do in Python just using the
csv
and os modules.Python has a built-in dictionary type called
dict
that could be used to store the contents of the csv file in-memory while you are processing. Basically, you would need to read the csv file using thecsv
module and convert each entry into a dictionary entry, probably using theOLD_FILEID
field as the key and theTIMESORT_FILEID
as the value.You can then use
os.listdir()
to get the list of files and use a loop to get each file name in turn. (If you need to filter the list of file names to exclude some files, take a look at theglob
module). Inside your loop, you just need to extract the number associated with the file, which can be done using something like this:Then call
os.rename()
passing in the old file name and the new file name. The new filename can be found using something like:Where
file_mapping
is the dictionary created from the csv file. This will replace the first occurrence of thefile_number
with the number from your mapping file.Edit
As TheodrosZelleke points out, there is the potential to overwrite an existing file by literally following what I laid out above. Several possible strategies:
os.rename()
to move the renamed versions of the files into a different directory (e.g. a subdirectory of the current directory or, even better, a temporary directory created usingtempfile.mkdtemp()
. Once all the files have been renamed, useos.rename
to move the files from the temporary directory to the current directory..tmp
, assuming that the extension chosen will not cause other conflicts. Once all the renames are done, use a second loop to rename the files to exclude the.tmp
extension.