我有每片含有的n-gram,看起来像这两CSV文件:
drinks while strutting,4,1.435486010883783160220299732E-8
and since that,6,4.306458032651349480660899195E-8
the state face,3,2.153229016325674740330449597E-8
这是一个三字短语后跟一个频数,其后是相对频率数。
我想写发现,在两种CSV文件的n-gram,划分它们的相对频率,并把结果打印到一个新的csv文件的脚本。 我希望它找到一个匹配,只要这三个词短语匹配其他文件三个词短语,然后在第二csv文件相同的短语的相对频率划分在第一csv文件短语的相对频率。 然后我想打印的短语和两个相对频率提高到一个新的CSV文件的分割。
下面是据我已经得到了。 我的脚本比较线,但只找到一个匹配时,整条生产线(包括频率和相对频率)完全匹配。 我知道那是因为我发现两个完整集之间的交集,但我不知道如何以不同的方式做到这一点。 请原谅我; 我是新来的编码。 任何帮助,您可以给我得到一点点接近将是这样一个很大的帮助。
import csv
import io
alist, blist = [], []
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist.append(row)
first_set = set(map(tuple, alist))
secnd_set = set(map(tuple, blist))
matches = set(first_set).intersection(secnd_set)
c = csv.writer(open("matchedngrams.csv", "a"))
c.writerow(matches)
print matches
print len(matches)
如果没有转储res
在一个新的文件(乏味)。 这个想法是,所述第一元件是所述短语和其它两个的频率。 使用dict
而不是set
做匹配和映射在一起。
import csv
import io
alist, blist = [], []
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter=',')
for row in reader:
blist.append(row)
f_dict = {e[0]:e[1:] for e in alist}
s_dict = {e[0]:e[1:] for e in blist}
res = {}
for k,v in f_dict.items():
if k in s_dict:
res[k] = float(v[1])/float(s_dict[k][1])
print(res)
你可以从一号文件转换成字典存储的相对频率,然后遍历第二个文件,并在第一列匹配原始文件见过,直接写出结果到输出文件:
import csv
tmp = {}
# if 1 file is much larger than the other, load the smaller one here
# make sure it will fit into the memory
with open("ngrams.csv", "rb") as fr:
# using tuple unpacking to extract fixed number of columns from each row
for txt, abs, rel in csv.reader(fr):
# converting strings like "1.435486010883783160220299732E-8"
# to float numbers
tmp[txt] = float(rel)
with open("matchedngrams.csv", "wb") as fw:
writer = csv.writer(fw)
# the 2nd input file will be processed per 1 line to save memory
# the order of items from this file will be preserved
with open("ngramstest.csv", "rb") as fr:
for txt, abs, rel in csv.reader(fr):
if txt in tmp:
# not sure what you want to do with absolute, I use 0 here:
writer.writerow((txt, 0, tmp[txt] / float(rel)))
我的脚本比较线,但只找到一个匹配时,整条生产线(包括频率和相对频率)完全匹配。 我知道那是因为我发现两个完整集之间的交集,但我不知道如何以不同的方式做到这一点。
这是使用什么字典对于:当你有一个独立的键和值(或当只有部分的价值是关键)。 所以:
a_dict = {row[0]: row for row in alist}
b_dict = {row[0]: row for row in blist}
现在,你不能直接使用于字典set方法。 Python 3中提供了一些帮助,在这里,但你使用2.7。 所以,你必须把它明确写入:
matches = {key for key in a_dict if key in b_dict}
要么:
matches = set(a_dict) & set(b_dict)
但是,你真的不需要设定; 所有你想要做的,是在它们之间迭代。 所以:
for key in a_dict:
if key in b_dict:
a_values = a_dict[key]
b_values = b_dict[key]
do_stuff_with(a_values[2], b_values[2])
作为一个侧面说明,你真的不需要建立名单摆在首位只是把它们变成集或类型的字典。 刚建立的组或类型的字典:
a_set = set()
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
a_set.add(tuple(row))
a_dict = {}
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
for row in reader:
a_dict[row[0]] = row
另外,如果你知道内涵,这三个版本都大声疾呼要转换:
with open("ngrams.csv", "rb") as fileA:
reader = csv.reader(fileA, delimiter=',')
# Now any of these
a_list = list(reader)
a_set = {tuple(row) for row in reader}
a_dict = {row[0]: row for row in reader}
避免保存小数目,因为他们,他们进入下溢问题(见什么是算术下溢和溢出,C 2 ),除少数与另一会给你更加溢问题,所以这样做预处理的相对频率为这样的:
>>> import math
>>> num = 1.435486010883783160220299732E-8
>>> logged = math.log(num)
>>> logged
-18.0591772685384
>>> math.exp(logged)
1.4354860108837844e-08
现在到了的阅读csv
。 既然你只操纵的相对频率,您的第2列并不重要,所以让我们跳过,节省了第一列(即词组)作为关键和第三列(即相对频率)的值:
import csv, math
# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""
textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""
with open('ngrams-1.csv', 'w') as fout:
for line in textfile.split('\n'):
fout.write(line + '\n')
with open('ngrams-2.csv', 'w') as fout:
for line in textfile2.split('\n'):
fout.write(line + '\n')
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'
ngramdict1 = {}
ngramdict2 = {}
with open(ngramfile1, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict1[phrase] = math.log(float(rel))
with open(ngramfile2, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict2[phrase] = math.log(float(rel))
现在到了棘手的部分要通过ngramdict1的短语,即ngramdict2的短语的相对频率的划分:
if phrase_from_ngramdict1 == phrase_from_ngramdict2:
relfreq = relfreq_from_ngramdict2 / relfreq_from_ngramdict1
由于我们保留在logarithic单位的相对频率,我们没有分裂,而是简单地减去它,即
if phrase_from_ngramdict1 == phrase_from_ngramdict2:
logrelfreq = logrelfreq_from_ngramdict2 - logrelfreq_from_ngramdict1
并获得发生在这两个短语,你不会需要逐个检查词组只需使用铸dictionary.keys()
为一组,然后做set1.intersection(set2)
请参阅HTTPS://docs.python .ORG / 2 /教程/ datastructures.html
phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)
print overlap_phrases
[OUT]:
set(['drinks while strutting', 'the state face', 'and since that'])
现在让我们把它打印出来,相对频率:
with open('ngramcombined.csv', 'w') as fout:
for p in overlap_phrases:
relfreq1 = ngramdict1[p]
relfreq2 = ngramdict2[p]
combined_relfreq = relfreq2 - relfreq1
fout.write(",".join([p, str(combined_relfreq)])+ '\n')
该ngramcombined.csv
看起来是这样的:
drinks while strutting,-0.69314718056
the state face,-1.09861228867
and since that,-0.69314718056
下面是完整的代码:
import csv, math
# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""
textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""
with open('ngrams-1.csv', 'w') as fout:
for line in textfile.split('\n'):
fout.write(line + '\n')
with open('ngrams-2.csv', 'w') as fout:
for line in textfile2.split('\n'):
fout.write(line + '\n')
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'
ngramdict1 = {}
ngramdict2 = {}
with open(ngramfile1, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict1[phrase] = math.log(float(rel))
with open(ngramfile2, 'r') as fin:
reader = csv.reader(fin, delimiter=',')
for row in reader:
phrase, raw, rel = row
ngramdict2[phrase] = math.log(float(rel))
# Find the intersecting phrases.
phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)
# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
for p in overlap_phrases:
relfreq1 = ngramdict1[p]
relfreq2 = ngramdict2[p]
combined_relfreq = relfreq2 - relfreq1
fout.write(",".join([p, str(combined_relfreq)])+ '\n')
如果你喜欢SUPER UNREADBLE但短码(在没有台词的。):
import csv, math
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'
ngramdict1 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile1, 'r'), delimiter=',')}
ngramdict2 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile2, 'r'), delimiter=',')}
# Find the intersecting phrases.
overlap_phrases = set(ngramdict1.keys()).intersection(set(ngramdict2.keys()))
# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
for p in overlap_phrases:
fout.write(",".join([p, str(ngramdict2[p] - ngramdict1[p])])+ '\n')
文章来源: Comparing the first columns in two csv files using python and printing matches