使用python和印刷匹配两种CSV文件第一列比较(Comparing the first colu

2019-10-21 09:04发布

我有每片含有的n-gram,看起来像这两CSV文件:

drinks while strutting,4,1.435486010883783160220299732E-8
and since that,6,4.306458032651349480660899195E-8
the state face,3,2.153229016325674740330449597E-8

这是一个三字短语后跟一个频数,其后是相对频率数。

我想写发现,在两种CSV文件的n-gram,划分它们的相对频率,并把结果打印到一个新的csv文件的脚本。 我希望它找到一个匹配,只要这三个词短语匹配其他文件三个词短语,然后在第二csv文件相同的短语的相对频率划分在第一csv文件短语的相对频率。 然后我想打印的短语和两个相对频率提高到一个新的CSV文件的分割。

下面是据我已经得到了。 我的脚本比较线,但只找到一个匹配时,整条生产线(包括频率和相对频率)完全匹配。 我知道那是因为我发现两个完整集之间的交集,但我不知道如何以不同的方式做到这一点。 请原谅我; 我是新来的编码。 任何帮助,您可以给我得到一点点接近将是这样一个很大的帮助。

import csv
import io 

alist, blist = [], []

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
    reader = csv.reader(fileB, delimiter=',')
    for row in reader:
        blist.append(row)

first_set = set(map(tuple, alist))
secnd_set = set(map(tuple, blist))

matches = set(first_set).intersection(secnd_set)

c = csv.writer(open("matchedngrams.csv", "a"))
c.writerow(matches)

print matches
print len(matches)

Answer 1:

如果没有转储res在一个新的文件(乏味)。 这个想法是,所述第一元件是所述短语和其它两个的频率。 使用dict而不是set做匹配和映射在一起。

import csv
import io 

alist, blist = [], []

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
    reader = csv.reader(fileB, delimiter=',')
    for row in reader:
        blist.append(row)

f_dict = {e[0]:e[1:] for e in alist}
s_dict = {e[0]:e[1:] for e in blist}

res = {}
for k,v in f_dict.items():
    if k in s_dict:
        res[k] = float(v[1])/float(s_dict[k][1])

print(res)


Answer 2:

你可以从一号文件转换成字典存储的相对频率,然后遍历第二个文件,并在第一列匹配原始文件见过,直接写出结果到输出文件:

import csv

tmp = {}

# if 1 file is much larger than the other, load the smaller one here
# make sure it will fit into the memory
with open("ngrams.csv", "rb") as fr:
    # using tuple unpacking to extract fixed number of columns from each row
    for txt, abs, rel in csv.reader(fr):
        # converting strings like "1.435486010883783160220299732E-8"
        # to float numbers
        tmp[txt] = float(rel)

with open("matchedngrams.csv", "wb") as fw:
    writer = csv.writer(fw)

    # the 2nd input file will be processed per 1 line to save memory
    # the order of items from this file will be preserved
    with open("ngramstest.csv", "rb") as fr:
        for txt, abs, rel in csv.reader(fr):
            if txt in tmp:
                # not sure what you want to do with absolute, I use 0 here:
                writer.writerow((txt, 0, tmp[txt] / float(rel)))


Answer 3:

我的脚本比较线,但只找到一个匹配时,整条生产线(包括频率和相对频率)完全匹配。 我知道那是因为我发现两个完整集之间的交集,但我不知道如何以不同的方式做到这一点。

这是使用什么字典对于:当你有一个独立的键和值(或当只有部分的价值是关键)。 所以:

a_dict = {row[0]: row for row in alist}
b_dict = {row[0]: row for row in blist}

现在,你不能直接使用于字典set方法。 Python 3中提供了一些帮助,在这里,但你使用2.7。 所以,你必须把它明确写入:

matches = {key for key in a_dict if key in b_dict}

要么:

matches = set(a_dict) & set(b_dict)

但是,你真的不需要设定; 所有你想要做的,是在它们之间迭代。 所以:

for key in a_dict:
    if key in b_dict:
        a_values = a_dict[key]
        b_values = b_dict[key]
        do_stuff_with(a_values[2], b_values[2])

作为一个侧面说明,你真的不需要建立名单摆在首位只是把它们变成集或类型的字典。 刚建立的组或类型的字典:

a_set = set()
with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        a_set.add(tuple(row))

a_dict = {}
with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        a_dict[row[0]] = row

另外,如果你知道内涵,这三个版本都大声疾呼要转换:

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    # Now any of these
    a_list = list(reader)
    a_set = {tuple(row) for row in reader}
    a_dict = {row[0]: row for row in reader}


Answer 4:

避免保存小数目,因为他们,他们进入下溢问题(见什么是算术下溢和溢出,C 2 ),除少数与另一会给你更加溢问题,所以这样做预处理的相对频率为这样的:

>>> import math
>>> num = 1.435486010883783160220299732E-8
>>> logged = math.log(num)
>>> logged
-18.0591772685384
>>> math.exp(logged)
1.4354860108837844e-08

现在到了的阅读csv 。 既然你只操纵的相对频率,您的第2列并不重要,所以让我们跳过,节省了第一列(即词组)作为关键和第三列(即相对频率)的值:

import csv, math

# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""

textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""

with open('ngrams-1.csv', 'w') as fout:
    for line in textfile.split('\n'):
        fout.write(line + '\n')

with open('ngrams-2.csv', 'w') as fout:
    for line in textfile2.split('\n'):
        fout.write(line + '\n')


# Read and save the two files into a dict structure

ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {}
ngramdict2 = {}

with open(ngramfile1, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict1[phrase] = math.log(float(rel))

with open(ngramfile2, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict2[phrase] = math.log(float(rel))

现在到了棘手的部分要通过ngramdict1的短语,即ngramdict2的短语的相对频率的划分:

if phrase_from_ngramdict1 == phrase_from_ngramdict2:
  relfreq = relfreq_from_ngramdict2 / relfreq_from_ngramdict1

由于我们保留在logarithic单位的相对频率,我们没有分裂,而是简单地减去它,即

if phrase_from_ngramdict1 == phrase_from_ngramdict2:
  logrelfreq = logrelfreq_from_ngramdict2 - logrelfreq_from_ngramdict1

并获得发生在这两个短语,你不会需要逐个检查词组只需使用铸dictionary.keys()为一组,然后做set1.intersection(set2)请参阅HTTPS://docs.python .ORG / 2 /教程/ datastructures.html

phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)

print overlap_phrases

[OUT]:

set(['drinks while strutting', 'the state face', 'and since that'])

现在让我们把它打印出来,相对频率:

with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        relfreq1 = ngramdict1[p]
        relfreq2 = ngramdict2[p]
        combined_relfreq = relfreq2 - relfreq1
        fout.write(",".join([p, str(combined_relfreq)])+ '\n')

ngramcombined.csv看起来是这样的:

drinks while strutting,-0.69314718056
the state face,-1.09861228867
and since that,-0.69314718056

下面是完整的代码:

import csv, math

# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""

textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""

with open('ngrams-1.csv', 'w') as fout:
    for line in textfile.split('\n'):
        fout.write(line + '\n')

with open('ngrams-2.csv', 'w') as fout:
    for line in textfile2.split('\n'):
        fout.write(line + '\n')


# Read and save the two files into a dict structure

ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {}
ngramdict2 = {}

with open(ngramfile1, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict1[phrase] = math.log(float(rel))

with open(ngramfile2, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict2[phrase] = math.log(float(rel))


# Find the intersecting phrases.
phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)

# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        relfreq1 = ngramdict1[p]
        relfreq2 = ngramdict2[p]
        combined_relfreq = relfreq2 - relfreq1
        fout.write(",".join([p, str(combined_relfreq)])+ '\n')

如果你喜欢SUPER UNREADBLE但短码(在没有台词的。):

import csv, math
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile1, 'r'), delimiter=',')}
ngramdict2 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile2, 'r'), delimiter=',')}

# Find the intersecting phrases.
overlap_phrases = set(ngramdict1.keys()).intersection(set(ngramdict2.keys()))

# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        fout.write(",".join([p, str(ngramdict2[p] - ngramdict1[p])])+ '\n')


文章来源: Comparing the first columns in two csv files using python and printing matches