I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:
- Compare two csv files that contain user information. It's essentially a member list where one file is a more updated version of the other.
- The files contain data such as ID, name, status, etc, etc
- Write to a third csv file ONLY the records in the new file that either don't exist in the older file, or contain updated information. For each record, there is a unique ID that allows me to determine if a record is new or previously existed.
Here is the code I have written so far:
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = []
new = []
for row in fOld:
old.append(row)
for row in fNew:
new.append(row)
output = []
x = len(new)
i = 0
num = 0
while i < x:
if new[num] not in old:
fNewUpdate.writerow(new[num])
num += 1
i += 1
fileAin.close()
fileBin.close()
fileCout.close()
In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.
My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.
Thanks in advance for any help.
UPDATE to include sample row of data:
123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A