Algorithm to select a best combination from two li

2019-06-10 01:55发布

问题:

I have a search result from two way flight. So, there are two lists that contain the departure flights and arrival flights such as:

  • The departure flights list has 20 flights.
  • The arrival flights list has 30 flights

So, I will have 600 (20*30) combination between departure flight and arrival flight. I will call the combination list is the result list

However, I just want to select a limitation from 600 combination. For instance, I will select the best of 100 flight combination. The criteria to combine the flights is the cheap price for departure and arrival flight.

To do that, I will sort the result list by the total price of departure and arrival flight. And I then pick up the first 100 elements from result list to get what I want.

But, if the departure flights list has 200 flights and arrival flights list has 300 flights, I will have the result list with 60.000 elements. For that reason, I will sort a list with 60.000 elements to find the best 100 elements.

So, there is any an algorithm to select the best combinations as my case.

Thank you so much.

回答1:

Not 100% clear from your question, but I understand that you are looking for a faster algorithm to find a certain number of best / cheapest combinations of departure and arrival flights.

You can do this much faster by sorting the lists of departure and arrival flights individually by cost and then using a heap for expanding the next-best combinations one-by-one until you have enough.

Here's the full algorithm -- in Python, but without using any special libraries, just standard data structures, so this should be easily transferable to any other language:

NUM_FLIGHTS, NUM_BEST = 1000, 100

# create test data: each entry corresponds to just the cost of one flight
from random import randint
dep = sorted([randint(1, 100) for i in range(NUM_FLIGHTS)])
arr = sorted([randint(1, 100) for i in range(NUM_FLIGHTS)])
def is_compatible(i, j): # for checking constraints, e.g. timing of flights
    return True          # but for now, assume no constraints

# get best combination using sorted lists and heap
from heapq import heappush, heappop
heap = [(dep[0] + arr[0], 0, 0)]   # initial: best combination from dep and arr
result = []                        # the result list
visited = set()                    # make sure not to add combinations twice
while heap and len(result) < NUM_BEST:
    cost, i, j = heappop(heap)     # get next-best combination
    if (i, j) in visited: continue # did we see those before? skip
    visited.add((i, j))
    if is_compatible(i, j):        # if 'compatible', add to results
        result.append((cost, dep[i], arr[j]))
    # add 'adjacent' combinations to the heap
    if i < len(dep) - 1:           # next-best departure + same arrival
        heappush(heap, (dep[i+1] + arr[j], i+1, j))
    if j < len(arr) - 1:           # same departure + next-best arrival
        heappush(heap, (dep[i] + arr[j+1], i, j+1))
print result

# just for testing: compare to brute-force (get best from all combinations)
comb = [(d, a) for d in dep for a in arr]
best = sorted((d+a, d, a) for (d, a) in comb)[:NUM_BEST]
print best
print result == best # True -> same results as brute force (just faster)

This works roughly like this:

  • sort both the departure flights dep and the arrival flights arr by their cost
  • create a heap and put the best combination (best departure and best arrival) as well as the corresponding indices in their lists into the heap: (dep[0] + arr[0], 0, 0)
  • repeat until you have enough combinations or there are no more elements in the heap:
    • pop the best element from the heap (sorted by total cost)
    • if it satisfies the contraints, add it to the result set
    • make sure you do not add flights twice to the result set, using visited set
    • add the two 'adjacent' combinations to the heap, i.e. taking the same flight from dep and the next from arr, and the next from dep and the same from arr, i.e. (dep[i+1] + arr[j], i+1, j) and (dep[i] + arr[j+1], i, j+1)

Here's a very small worked example. The axes are (the costs of) the dep and arr flights, and the entries in the table are in the form n(c)m, where n is the iteration that entry was added to the heap (if it is at all), c is the cost, and m is the iteration it was added to the 'top 10' result list (if any).

dep\arr     1       3       4       6       7
   2      0(3)1   1(5)4   4(6)8   8(8)-     -
   2      1(3)2   2(5)6   6(6)9   9(8)-     -
   3      2(4)3   3(6)7   7(7)-     -       -
   4      3(5)5   5(7)-     -       -       -
   6      5(7)10    -       -       -       -

Result: (1,2), (1,2), (1,3), (3,2), (1,4), (3,2), (3,3), (2,4), (2,4), (1,6)

Note how the sums in both the columns and the rows of the matrix are always increasing, so the best results can always be found in a somewhat triangular area in the top-left. Now the idea is that if your currently best combination (the one that's first in the heap) is dep[i], arr[i], then there's no use in checking, e.g., combination dep[i+2], arr[i] before checking dep[i+1], arr[i], which must have a lower total cost, so you add dep[i+1], arr[i] (and likewise dep[i], arr[i+1]) to the heap, and repeat with popping the next element from the heap.


I compared the results of this algorithm to the results of your brute-force approach, and the resulting flights are the same, i.e. the algorithm works, and always yields the optimal result. Complexity should be O(n log(n)) for sorting the departure and arrival lists (n being the number of flights in those original lists), plus O(m log(m)) for the heap-loop (m iterations with log(m) work per iteration, m being the number of elements in the result list).

This finds the best 1,000 combinations of 100,000 departure and 100,000 arrival flights (for a total of 1,000,000,000,000 possible combinations) in less than one second.

Note that those numbers are for the case that you have no additional constraints, i.e. each departure flight can be combined with each arrival flight. If there are constraints, you can use the is_compatible function sketched in the above code to check those and to skip that pairing. This means, that for each incompatible pair with low total cost, the loop needs one additional iteration. This means that in the worst case, for example if there are no compatible pairs at all, or when the only compatible pairs are those with the highest total cost, the algorithm could in fact expand all the combination.

On average, though, this should not be the case, and the algorithm should perform rather quickly.



回答2:

I think the best solution would be using some SQL statements to do the Cartesian product. You can apply any kind of filters, based on the data itself, ordering, range selection, etc. Something like this:

SELECT d.time as dep_time, a.time as arr_time, d.price+a.price as total_price
FROM departures d, arrivals a
WHERE a.time > d.time + X
ORDER BY d.price+a.price
LIMIT 0,100

Actually X can be even 0, but arrival should happen AFTER the departure anyways.

Why I would choose SQL:

  • It's closest to the data itself, you don't have to query them
  • It's highly optimized, if you use indexes, I'm sure you can't beat its performance with your own code
  • It's simple and declarative :)