I'd like to compute the similarity between two lists of various lengths.
eg:
listA = ['apple', 'orange', 'apple', 'apple', 'banana', 'orange'] # (length = 6)
listB = ['apple', 'orange', 'grapefruit', 'apple'] # (length = 4)
as you can see, a single item can appear multiple times in a list, and the lengths are of different sizes.
I've already thought of comparing the frequencies of each item, but that does not encompass the size of each list (a list that is simply twice another list should be similar, but not perfectly similar)
eg2:
listA = ['apple', 'apple', 'orange', 'orange']
listB = ['apple', 'orange']
similarity(listA, listB) # should NOT equal 1
So I basically want to encompass the size of the lists, and the distribution of items in the list.
Any ideas?
I believe what you are looking for is to counting the number of inversions in an array The question has your answer: Counting inversions in an array
Use
collections.Counter()
perhaps; those are multi-sets, or bags, in datatype parlance:Now you can compare these by entries or frequencies:
You can calculate their cosine similarity using:
Which gives:
The closer to 1 that value, the more similar the two lists are.
The cosine similarity is one score you can calculate. If you care about the length of the list, you can calculate another; if you keep that score between 0.0 and 1.0 as well you can multiply the two values for a final score between -1.0 and 1.0.
For example, to take relative lengths into account you could use:
and then combine into a function that takes the lists as inputs:
For your two example lists, that results in:
You can mix in other metrics as needed.
From a theoretical point of view : I recommend you look up cosine similarity http://en.wikipedia.org/wiki/Cosine_similarity
You may have to modify to fit your scheme, but the idea of cosine similarity is great.