Comparing two columns of a csv and outputting stri

2019-01-24 19:57发布

问题:

I am very new to python programming. I am trying to take a csv file that has two columns of string values and want to compare the similarity ratio of the string between both columns. Then I want to take the values and output the ratio in another file.

The csv may look like this:

Column 1|Column 2 
tomato|tomatoe 
potato|potatao 
apple|appel 

I want the output file to show for each row, how similar the string in Column 1 is to Column 2. I am using difflib to output the ratio score.

This is the code I have so far:

import csv
import difflib

f = open('test.csv')

csf_f = csv.reader(f)

row_a = []
row_b = []

for row in csf_f:
    row_a.append(row[0])
    row_b.append(row[1])

a = row_a
b = row_b

def similar(a, b):
    return difflib.SequenceMatcher(a, b).ratio()

match_ratio = similar(a, b)

match_list = []
for row in match_ratio:
    match_list.append(row)

with open("output.csv", "wb") as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerows(match_list)

f.close()

I get the error:

Traceback (most recent call last):
  File "comparison.py", line 24, in <module>
    for row in match_ratio:
TypeError: 'float' object is not iterable

I feel like I am not importing the column list correctly and running it against the sequencematcher function.

回答1:

Here is another way to get this done using pandas:

Consider your csv data is like this:

Column 1,Column 2 
tomato,tomatoe 
potato,potatao 
apple,appel

CODE

import pandas as pd
import difflib as diff
#Read the CSV
df = pd.read_csv('datac.csv')
#Create a new column 'diff' and get the result of comparision to it
df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1) 
#Save the dataframe to CSV and you could also save it in other formats like excel, html etc
df.to_csv('outdata.csv',index=False)

Result

Column 1,Column 2 ,diff
tomato,tomatoe ,0.923076923077
potato,potatao ,0.923076923077
apple,appel ,0.8


回答2:

The for loop you're setting up here expects something like an array where you have match_ratio, and judging by the error you're getting, that's not what you have. It looks like you're missing the first argument for difflib.SequenceMatcher, which should probably be None. See 6.3.1 here: https://docs.python.org/3/library/difflib.html

Without that first argument specified, I think you're getting back 0.0 from difflib.SequenceMatcher and then trying to run ratio off of that. Even if you correct your SequenceMatcher call, I think you'll still be trying to iterate on a single float value that ratio is returning. I think you need to call SequenceMatcher inside the loop for each set of values you're comparing.

So you'd wind up with a call more like this in your function: difflib.SequenceMatcher(None, a, b). Or if you'd prefer, since these are named arguments, you could do something like this: difflib.SequenceMatcher(a=a, b=b).



回答3:

Your sample file looks like it contains markup tags. Assuming you are actually reading a CSV file, the error you are getting is because match_ratio is not an iterable datatype, it's a floating point number -- the return value of your function: similar(). In your code, the function call would have to be contained within a for loop to call it for each a, b string pair. Here's a working example I created that does away with the explicit for loops and uses a list comprehension instead:

import csv
from difflib import SequenceMatcher

path_in = 'csv1.csv'
path_out = 'csv2.csv'

with open(path_in, 'r') as csv_file_in:
    csv_reader = csv.reader(csv_file_in)
    col_headers = csv_reader.next()
    for row in csv_reader:
        results = [[row[0],
                    row[1],
                    SequenceMatcher(None, row[0], row[1]).ratio()]
                    for row in csv_reader]

with open(path_out, 'wb') as csv_file_out:
    col_headers.append('Ratio')
    out_rows = [col_headers] + results
    writer = csv.writer(csv_file_out, delimiter=',')
    writer.writerows(out_rows)

In addition to the error you received you might also have run into a problem when instantiating the SequenceMatcher object -- its first parameter wasn't specified in your code. You can find more on list comprehensions and SequenceMatcher in the Python docs. Good luck in your future Python coding.



回答4:

You are getting that error because the records row[0] or row[1] contain most probably NaN values. Try forcing them to string first by making str(row[0]) and str(row[1])



回答5:

You are getting the error because you are running SequenceMatcher on the list of strings, rather than on the strings themselves. When you do this, you get back a single float value, rather than the list of ration values I think you were expecting.

If I understand what you are trying to do, then you don't need to read in the rows first. You can simply find the diff ratio as you iterate through the rows.

import csv
import difflib

match_list = []
with open('test.csv') as f:
    csv_f = csv.reader(f)
    for row in csv_f:
        match_list.append([difflib.SequenceMatcher(a=row[0], b=row[1]).ratio()])

with open('output.csv', 'w') as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerows(match_list)