Optimize python file comparison script

2019-07-29 21:49发布

问题:

I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:

  • Compare two csv files that contain user information. It's essentially a member list where one file is a more updated version of the other.
  • The files contain data such as ID, name, status, etc, etc
  • Write to a third csv file ONLY the records in the new file that either don't exist in the older file, or contain updated information. For each record, there is a unique ID that allows me to determine if a record is new or previously existed.

Here is the code I have written so far:

import csv

fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)

fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)

fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)

old = []
new = []

for row in fOld:
    old.append(row)
for row in fNew:
    new.append(row)

output = []

x = len(new)
i = 0
num = 0

while i < x:
    if new[num] not in old:
        fNewUpdate.writerow(new[num])

    num += 1
    i += 1

fileAin.close()
fileBin.close()
fileCout.close()

In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.

My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.

Thanks in advance for any help.

UPDATE to include sample row of data:

123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A

回答1:

How about something like this? One of the biggest inefficiencies of your code is checking whether new[num] is in old every time because old is a list so you have to iterate through the entire list. Using a dictionary is much much faster.

import csv

fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)

fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)

fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)

old = {row[0]:row[1:] for row in fOld}
new = {row[0]:row[1:] for row in fNew}
fileAin.close()
fileBin.close()

output = {}

for row_id in new:
    if row_id not in old or not old[row_id] == new[row_id]:
        output[row_id] = new[row_id]

for row_id in output:
    fNewUpdate.writerow([row_id] + output[row_id])


fileCout.close()


回答2:

difflib is quite efficient: http://docs.python.org/library/difflib.html



回答3:

Sort the data by your unique field(s), and then use a comparison process analogous to the merge step of merge sort:

http://en.wikipedia.org/wiki/Merge_sort