A faster way of generating combinations with a giv

2019-07-31 04:13发布

问题:

TL;DR: I want the exact behavior as filter ((== 4) . length) . subsequences. Just using subsequences also creates variable length of lists, which takes a lot of time to process. Since in the end only lists of length 4 are needed, I was thinking there must be a faster way.


I have a list of functions. The list has the type [Wor -> Wor]

The list looks something like this

[f1, f2, f3 .. fn]

What I want is a list of lists of n functions while preserving order like this

input : [f1, f2, f3 .. fn]

argument : 4 functions

output : A list of lists of 4 functions.

Expected output would be where if there's an f1 in the sublist, it'll always be at the head of the list.

If there's a f2 in the sublist and if the sublist doens't have f1, f2 would be at head. If fn is in the sublist, it'll be at last.

In general if there's a fx in the list, it never will be infront of f(x - 1) .

Basically preserving the main list's order when generating sublists.

It can be assumed that length of list will always be greater then given argument.

I'm just starting to learn Haskell so I haven't tried all that much but so far this is what I have tried is this:

Generation permutations with subsequences function and applying (filter (== 4) . length) on it seems to generate correct permutations -but it doesn't preserve order- (It preserves order, I was confusing it with my own function).

So what should I do?

Also if possible, is there a function or a combination of functions present in Hackage or Stackage which can do this? Because I would like to understand the source.

回答1:

You describe a nondeterministic take:

ndtake :: Int -> [a] -> [[a]]
ndtake 0 _      = [[]]
ndtake n []     = []
ndtake n (x:xs) = map (x:) (ndtake (n-1) xs) ++ ndtake n xs

Either we take an x, and have n-1 more to take from xs; or we don't take the x and have n more elements to take from xs.

Running:

> ndtake 3 [1..4]
[[1,2,3],[1,2,4],[1,3,4],[2,3,4]]

Update: you wanted efficiency. If we're sure the input list is finite, we can aim at stopping as soon as possible:

ndetake n xs = go (length xs) n xs
    where
    go spare n _  | n >  spare = []
    go spare n xs | n == spare = [xs]
    go spare 0 _      =  [[]]
    go spare n []     =  []
    go spare n (x:xs) =  map (x:) (go (spare-1) (n-1) xs) 
                            ++     go (spare-1)  n   xs

Trying it:

> length $ ndetake 443 [1..444]
444

The former version seems to be stuck on this input, but the latter one returns immediately.


But, it measures the length of the whole list, and needlessly so, as pointed out by @dfeuer in the comments. We can achieve the same improvement in efficiency while retaining a bit more laziness:

ndzetake :: Int -> [a] -> [[a]]
ndzetake n xs | n > 0 = 
    go n (length (take n xs) == n) (drop n xs) xs
    where
    go n b p ~(x:xs)
         | n == 0 = [[]]
         | not b  = []
         | null p = [(x:xs)]
         | otherwise = map (x:) (go (n-1) b p xs)
                          ++ go n b (tail p) xs

Now the last test also works instantly with this code as well.

There's still room for improvement here. Just as with the library function subsequences, the search space could be explored even more lazily. Right now we have

> take 9 $ ndzetake 3 [1..]
[[1,2,3],[1,2,4],[1,2,5],[1,2,6],[1,2,7],[1,2,8],[1,2,9],[1,2,10],[1,2,11]]

but it could be finding [2,3,4] before forcing the 5 out of the input list. Shall we leave it as an exercise?



回答2:

Here's the best I've been able to come up with. It answers the challenge Will Ness laid down to be as lazy as possible in the input. In particular, ndtake m ([1..n]++undefined) will produce as many entries as possible before throwing an exception. Furthermore, it strives to maximize sharing among the result lists (note the treatment of end in ndtakeEnding'). It avoids problems with badly balanced list appends using a difference list. This sequence-based version is considerably faster than any pure-list version I've come up with, but I haven't teased apart just why that is. I have the feeling it may be possible to do even better with a better understanding of just what's going on, but this seems to work pretty well.

Here's the general idea. Suppose we ask for ndtake 3 [1..5]. We first produce all the results ending in 3 (of which there is one). Then we produce all the results ending in 4. We do this by (essentially) calling ndtake 2 [1..3] and adding the 4 onto each result. We continue in this manner until we have no more elements.

import qualified Data.Sequence as S
import Data.Sequence (Seq, (|>))
import Data.Foldable (toList)

We will use the following simple utility function. It's almost the same as splitAtExactMay from the 'safe' package, but hopefully a bit easier to understand. For reasons I haven't investigated, letting this produce a result when its argument is negative leads to ndtake with a negative argument being equivalent to subsequences. If you want, you can easily change ndtake to do something else for negative arguments.

-- to return an empty list in the negative case.
splitAtMay :: Int -> [a] -> Maybe ([a], [a])
splitAtMay n xs
  | n <= 0 = Just ([], xs)
splitAtMay _ [] = Nothing
splitAtMay n (x : xs) = flip fmap (splitAtMay (n - 1) xs) $
  \(front, rear) -> (x : front, rear)

Now we really get started. ndtake is implemented using ndtakeEnding, which produces a sort of "difference list", allowing all the partial results to be concatenated cheaply.

ndtake :: Int -> [t] -> [[t]]
ndtake n xs = ndtakeEnding n xs []

ndtakeEnding :: Int -> [t] -> ([[t]] -> [[t]])
ndtakeEnding 0 _xs = ([]:)
ndtakeEnding n xs = case splitAtMay n xs of
    Nothing -> id -- Not enough elements
    Just (front, rear) ->
        (front :) . go rear (S.fromList front)
  where
    -- For each element, produce a list of all combinations
    -- *ending* with that element.
    go [] _front = id
    go (r : rs) front =
      ndtakeEnding' [r] (n - 1) front
        . go rs (front |> r)

ndtakeEnding doesn't call itself recursively. Rather, it calls ndtakeEnding' to calculate the combinations of the front part. ndtakeEnding' is very much like ndtakeEnding, but with a few differences:

  1. We use a Seq rather than a list to represent the input sequence. This lets us split and snoc cheaply, but I'm not yet sure why that seems to give amortized performance that is so much better in this case.
  2. We already know that the input sequence is long enough, so we don't need to check.
  3. We're passed a tail (end) to add to each result. This lets us share tails when possible. There are lots of opportunities for sharing tails, so this can be expected to be a substantial optimization.
  4. We use foldr rather than pattern matching. Doing this manually with pattern matching gives clearer code, but worse constant factors. That's because the :<|, and :|> patterns exported from Data.Sequence are non-trivial pattern synonyms that perform a bit of calculation, including amortized O(1) allocation, to build the tail or initial segment, whereas folds don't need to build those.

NB: this implementation of ndtakeEnding' works well for recent GHC and containers; it seems less efficient for earlier versions. That might be the work of Donnacha Kidney on foldr for Data.Sequence. In earlier versions, it might be more efficient to pattern match by hand, using viewl for versions that don't offer the pattern synonyms.

ndtakeEnding' :: [t] -> Int -> Seq t -> ([[t]] -> [[t]])
ndtakeEnding' end 0 _xs = (end:)
ndtakeEnding' end n xs = case S.splitAt n xs of
     (front, rear) ->
        ((toList front ++ end) :) . go rear front
  where
    go = foldr go' (const id) where
      go' r k !front = ndtakeEnding' (r : end) (n - 1) front . k (front |> r)
    -- With patterns, a bit less efficiently:
    -- go Empty _front = id
    -- go (r :<| rs) !front =
    --  ndtakeEnding' (r : end) (n - 1) front
    --    . go rs (front :|> r)