Comparing multiple csv files and finding matches

2019-09-09 09:56发布

问题:

I have two folders with csv files. A group of 'master' files and a group of 'unmatched' files. Within the master files (~25 files, about 50,000 lines in total), there are unique ids. Each row of the unmatched files (~250 files, about 700,000 lines in total) should have an id in the row that matches a single id in one of the master files. Within each of the unmatched files, all id's should match with a single master file. Further, all ids in the unmatched should fall within a single master.

Unfortunately, the columns aren't always consistent, and the id field may appear in row[2] or row[155]. (I'm using python for this) I was originally using set.intersection and finding matched instances where the length > 5 (there are missing values marked with a '.' or just a blank that I wanted to avoid.) but quickly learned the runtime is far too long. Generally speaking, I need to match the 'unmatched' files with its 'master' file, and I'd like to have the column index from the 'unmatched' file with the id used. So if the unmatched file unmatched_a has ids that mostly fall under master_d, and the matching column in unmatched_a is column 35, it would return a line:

unmatched_a, master_d, 35

Apologies if this is unclear -- I'd be happy to try and clarify if need be. First post on stackoverflow. I can post the code I've so far, but I don't think it would be useful because the issue is with my method of comparing multiple (relatively) large csv files. I saw a lot of posts comparing two csv files or files where the index_id is known, but nothing with multiple files and multiple files with potential matches.

回答1:

You'll have to start by reading all the master files into memory -- this is unavoidable, since the matching ids could be anywhere in the master files.

Then, for each unmatched file, you can read the first record and find its id (giving you the id column), then find the master file containing that id (giving you the matched master file). According to your description, once you've matched the first record, all the rest of the ids will be in the same file, so you're done.

Read the ids into a set -- checking membership is O(1). Put each set into a dict keyed to the name of the master_file. Iterating over the dict of masters is O(n). So this is O(nm) for the number of master files and the number of unmatched files.

import csv

def read_master_file(master_file):
    with open(master_file, "r") as file:
        reader = csv.reader(file)
        ids = set(line[0] for line in file) # I assumed the id is the first value in each row in the master files. Depending on the file format you will want to change this.
    return ids

def load_master_files(file_list):
    return {file: read_master_file(file) for file in file_list}

def check_unmatched_file(unmatched_file, master_dict):
    with open(unmatched_file, "r") as file:
        reader = csv.reader(file)
        record = next(reader)
    for id_column in [2, 155]: # if you can identify an id by semantics, rather than by attempting to match it against the masters, you can reduce running time by 25% by finding the id before this step
        id = record[id_column]
        for master in master_dict:
            if id in master_dict[master]:
                return unmatched_file, master, id
    return None # id not in any master. Feel free to return or raise whatever works best for you