I am using Python 2.7. I have routes which are composed of arrays of nodes that connect to each other. The nodes are identified by a string key, but for ease I will use numbers:
sample_route = [1,2,3,4,7]
#obviously over-simplified; real things would be about 20-40 elements long
I will create a set
made up of tuple pairs of point connections using zip, which will end up like:
set([(1,2),(2,3),(3,4),(4,7)])
I will need a way to filter out some routes that are very similar (like one or two added nodes) and use the minimal route out of those near-duplicates. My plan right now is to:
Start with the first (probably optimal) route. Iterate through the rest of the routes, and use the following formula to calculate its similarity to the sequence in step 1:
matching = len(value1.difference(value2)) + len(value2.difference(value1))
#value1, value2 = two compared sets
The lower the number, the more similar. But what is a good way to group the routes based on their similarity to other routes? They will all be different lengths. I have never taken a statistics course.
Example:
sets = [
set([(1,2),(2,3),(3,4),(4,5),(5,10)]),
set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]),
set([(1,7),(7,3),(3,8),(8,7),(7,6),(6,5),(5,10)]),
set([(1,2),(2,3),(3,4),(4,6),(6,5),(5,10)]),
set([(1,9),(9,4),(4,5),(5,10)]),
set([(1,9),(9,4),(4,6),(6,5),(5,10)])
]
In this example, the groupings may be something like [[1,2,4],[3],[5,6]]
, where 1, 2, and 4 are very similar, 5 and 6 are similar, and 3 is nowhere near any of the others. 1 to 2 would have a score of 2, and 3 to 6 would have a score of 8, as examples. This is the sort of data I am using (though these are easy to read simplifications).
There is a time benefit in this. If I can remove redundant routes, I will trim off a considerable amount of time.