比较多个CSV文件，并找到比赛(Comparing multiple csv files and f

我有一个CSV文件，两个文件夹。 A组的“主”文件和一组“无与伦比”的文件。主文件（〜25个文件，共约50000行）内，有唯一的ID。不匹配的文件（〜250个文件，总共大约700000线）的各行应具有在主文件中的一个相匹配的单个ID的行中的ID。在每一个无与伦比的文件，所有的ID应与一个单一的主文件。此外，在无与伦比的所有ID应该落在一个主内。

不幸的是，列并不总是一致的，而id字段可能出现在行[2]或行[155]。（我使用python这个）我原来是用set.intersection，并找到匹配的情况下，其中长度> 5（标有存在缺少值“”或只是一个空白，我想避免的。），但很快就学会运行时间是太长了。一般来说，我需要把“无与伦比”的文件与它的“主”文件匹配，我想有从使用的ID的“无与伦比”文件中的列索引。因此，如果不匹配的文件unmatched_a有大多归入master_d标识，并在unmatched_a匹配列35列，它会返回一个行：

unmatched_a，master_d，35

道歉，如果你不清楚这个 - 我很乐意尝试，如果需要的话澄清。第一篇文章的计算器。我可以张贴我到目前为止已经代码，但我不认为这将是有用的，因为这个问题是我的比较多（相对）大的CSV文件的方法。我看到了很多比较两个CSV文件，或在index_id的是已知的文件，但没有与多个文件和多个文件可能匹配的职位。

你必须通过阅读所有的主文件到内存中启动 - 这是不可避免的，因为匹配ID可以在主文件的任何地方。

然后，每个不匹配的文件，可以读取第一个记录，并找到其ID（给你的id列），然后找到包含ID（给你匹配的主文件）主文件。根据你的描述，一旦你匹配的第一个记录，该ID的所有其余的将在同一文件中，这样就大功告成了。

阅读IDS成为一个集 - 检查会员资格是O（1）。把每一组到键入到master_file的名字的字典。迭代主人的字典为O（n）。因此，这是对主文件的数目和无与伦比的文件数O（纳米）。

import csv

def read_master_file(master_file):
    with open(master_file, "r") as file:
        reader = csv.reader(file)
        ids = set(line[0] for line in file) # I assumed the id is the first value in each row in the master files. Depending on the file format you will want to change this.
    return ids

def load_master_files(file_list):
    return {file: read_master_file(file) for file in file_list}

def check_unmatched_file(unmatched_file, master_dict):
    with open(unmatched_file, "r") as file:
        reader = csv.reader(file)
        record = next(reader)
    for id_column in [2, 155]: # if you can identify an id by semantics, rather than by attempting to match it against the masters, you can reduce running time by 25% by finding the id before this step
        id = record[id_column]
        for master in master_dict:
            if id in master_dict[master]:
                return unmatched_file, master, id
    return None # id not in any master. Feel free to return or raise whatever works best for you