I am very new to python programming. I am trying to take a csv file that has two columns of string values and want to compare the similarity ratio of the string between both columns. Then I want to take the values and output the ratio in another file.
The csv may look like this:
Column 1|Column 2
tomato|tomatoe
potato|potatao
apple|appel
I want the output file to show for each row, how similar the string in Column 1 is to Column 2. I am using difflib to output the ratio score.
This is the code I have so far:
import csv
import difflib
f = open('test.csv')
csf_f = csv.reader(f)
row_a = []
row_b = []
for row in csf_f:
row_a.append(row[0])
row_b.append(row[1])
a = row_a
b = row_b
def similar(a, b):
return difflib.SequenceMatcher(a, b).ratio()
match_ratio = similar(a, b)
match_list = []
for row in match_ratio:
match_list.append(row)
with open("output.csv", "wb") as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(match_list)
f.close()
I get the error:
Traceback (most recent call last):
File "comparison.py", line 24, in <module>
for row in match_ratio:
TypeError: 'float' object is not iterable
I feel like I am not importing the column list correctly and running it against the sequencematcher function.
Your sample file looks like it contains markup tags. Assuming you are actually reading a CSV file, the error you are getting is because match_ratio is not an iterable datatype, it's a floating point number -- the return value of your function: similar(). In your code, the function call would have to be contained within a for loop to call it for each a, b string pair. Here's a working example I created that does away with the explicit for loops and uses a list comprehension instead:
In addition to the error you received you might also have run into a problem when instantiating the SequenceMatcher object -- its first parameter wasn't specified in your code. You can find more on list comprehensions and SequenceMatcher in the Python docs. Good luck in your future Python coding.
You are getting the error because you are running SequenceMatcher on the list of strings, rather than on the strings themselves. When you do this, you get back a single float value, rather than the list of ration values I think you were expecting.
If I understand what you are trying to do, then you don't need to read in the rows first. You can simply find the diff ratio as you iterate through the rows.
You are getting that error because the records row[0] or row[1] contain most probably NaN values. Try forcing them to string first by making str(row[0]) and str(row[1])
Here is another way to get this done using
pandas
:The
for
loop you're setting up here expects something like an array where you havematch_ratio
, and judging by the error you're getting, that's not what you have. It looks like you're missing the first argument fordifflib.SequenceMatcher
, which should probably beNone
. See 6.3.1 here: https://docs.python.org/3/library/difflib.htmlWithout that first argument specified, I think you're getting back
0.0
fromdifflib.SequenceMatcher
and then trying to runratio
off of that. Even if you correct yourSequenceMatcher
call, I think you'll still be trying to iterate on a single float value thatratio
is returning. I think you need to callSequenceMatcher
inside the loop for each set of values you're comparing.So you'd wind up with a call more like this in your function:
difflib.SequenceMatcher(None, a, b)
. Or if you'd prefer, since these are named arguments, you could do something like this:difflib.SequenceMatcher(a=a, b=b)
.