Optimize python file comparison script

2019-07-29 21:20发布

I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:

  • Compare two csv files that contain user information. It's essentially a member list where one file is a more updated version of the other.
  • The files contain data such as ID, name, status, etc, etc
  • Write to a third csv file ONLY the records in the new file that either don't exist in the older file, or contain updated information. For each record, there is a unique ID that allows me to determine if a record is new or previously existed.

Here is the code I have written so far:

import csv

fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)

fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)

fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)

old = []
new = []

for row in fOld:
    old.append(row)
for row in fNew:
    new.append(row)

output = []

x = len(new)
i = 0
num = 0

while i < x:
    if new[num] not in old:
        fNewUpdate.writerow(new[num])

    num += 1
    i += 1

fileAin.close()
fileBin.close()
fileCout.close()

In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.

My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.

Thanks in advance for any help.

UPDATE to include sample row of data:

123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A

3条回答
我欲成王,谁敢阻挡
2楼-- · 2019-07-29 22:23

How about something like this? One of the biggest inefficiencies of your code is checking whether new[num] is in old every time because old is a list so you have to iterate through the entire list. Using a dictionary is much much faster.

import csv

fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)

fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)

fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)

old = {row[0]:row[1:] for row in fOld}
new = {row[0]:row[1:] for row in fNew}
fileAin.close()
fileBin.close()

output = {}

for row_id in new:
    if row_id not in old or not old[row_id] == new[row_id]:
        output[row_id] = new[row_id]

for row_id in output:
    fNewUpdate.writerow([row_id] + output[row_id])


fileCout.close()
查看更多
淡お忘
3楼-- · 2019-07-29 22:24
Animai°情兽
4楼-- · 2019-07-29 22:24

Sort the data by your unique field(s), and then use a comparison process analogous to the merge step of merge sort:

http://en.wikipedia.org/wiki/Merge_sort

查看更多
登录 后发表回答