F#: How do i split up a sequence into a sequence o

2019-02-09 05:52发布

问题:

Background:

I have a sequence of contiguous, time-stamped data. The data-sequence has gaps in it where the data is not contiguous. I want create a method to split the sequence up into a sequence of sequences so that each subsequence contains contiguous data (split the input-sequence at the gaps).

Constraints:

  • The return value must be a sequence of sequences to ensure that elements are only produced as needed (cannot use list/array/cacheing)
  • The solution must NOT be O(n^2), probably ruling out a Seq.take - Seq.skip pattern (cf. Brian's post)
  • Bonus points for a functionally idiomatic approach (since I want to become more proficient at functional programming), but it's not a requirement.

Method signature

let groupContiguousDataPoints (timeBetweenContiguousDataPoints : TimeSpan) (dataPointsWithHoles : seq<DateTime * float>) : (seq<seq< DateTime * float >>)= ... 

On the face of it the problem looked trivial to me, but even employing Seq.pairwise, IEnumerator<_>, sequence comprehensions and yield statements, the solution eludes me. I am sure that this is because I still lack experience with combining F#-idioms, or possibly because there are some language-constructs that I have not yet been exposed to.

// Test data
let numbers = {1.0..1000.0}
let baseTime = DateTime.Now
let contiguousTimeStamps = seq { for n in numbers ->baseTime.AddMinutes(n)}

let dataWithOccationalHoles = Seq.zip contiguousTimeStamps numbers |> Seq.filter (fun (dateTime, num) -> num % 77.0 <> 0.0) // Has a gap in the data every 77 items

let timeBetweenContiguousValues = (new TimeSpan(0,1,0))

dataWithOccationalHoles |> groupContiguousDataPoints timeBetweenContiguousValues |> Seq.iteri (fun i sequence -> printfn "Group %d has %d data-points: Head: %f" i (Seq.length sequence) (snd(Seq.hd sequence)))

回答1:

I think this does what you want

dataWithOccationalHoles 
|> Seq.pairwise
|> Seq.map(fun ((time1,elem1),(time2,elem2)) -> if time2-time1 = timeBetweenContiguousValues then 0, ((time1,elem1),(time2,elem2)) else 1, ((time1,elem1),(time2,elem2)) )
|> Seq.scan(fun (indexres,(t1,e1),(t2,e2)) (index,((time1,elem1),(time2,elem2))) ->  (index+indexres,(time1,elem1),(time2,elem2))  ) (0,(baseTime,-1.0),(baseTime,-1.0))
|> Seq.map( fun (index,(time1,elem1),(time2,elem2)) -> index,(time2,elem2) )
|> Seq.filter( fun (_,(_,elem)) -> elem <> -1.0)
|> PSeq.groupBy(fst)
|> Seq.map(snd>>Seq.map(snd))

Thanks for asking this cool question



回答2:

I translated Alexey's Haskell to F#, but it's not pretty in F#, and still one element too eager.

I expect there is a better way, but I'll have to try again later.

let N = 20
let data =  // produce some arbitrary data with holes
    seq {
        for x in 1..N do
            if x % 4 <> 0 && x % 7 <> 0 then
                printfn "producing %d" x
                yield x
    }

let rec GroupBy comp (input:LazyList<'a>) : LazyList<LazyList<'a>> = 
    LazyList.delayed (fun () ->
    match input with
    | LazyList.Nil -> LazyList.cons (LazyList.empty()) (LazyList.empty())
    | LazyList.Cons(x,LazyList.Nil) -> 
        LazyList.cons (LazyList.cons x (LazyList.empty())) (LazyList.empty())
    | LazyList.Cons(x,(LazyList.Cons(y,_) as xs)) ->
        let groups = GroupBy comp xs
        if comp x y then
            LazyList.consf 
                (LazyList.consf x (fun () -> 
                    let (LazyList.Cons(firstGroup,_)) = groups
                    firstGroup)) 
                (fun () -> 
                    let (LazyList.Cons(_,otherGroups)) = groups
                    otherGroups)
        else
            LazyList.cons (LazyList.cons x (LazyList.empty())) groups)

let result = data |> LazyList.of_seq |> GroupBy (fun x y -> y = x + 1)
printfn "Consuming..."
for group in result do
    printfn "about to do a group"
    for x in group do
        printfn "  %d" x


回答3:

You seem to want a function that has signature

(`a -> bool) -> seq<'a> -> seq<seq<'a>>

I.e. a function and a sequence, then break up the input sequence into a sequence of sequences based on the result of the function.

Caching the values into a collection that implements IEnumerable would likely be simplest (albeit not exactly purist, but avoiding iterating the input multiple times. It will lose much of the laziness of the input):

let groupBy (fun: 'a -> bool) (input: seq) =
  seq {
    let cache = ref (new System.Collections.Generic.List())
    for e in input do
      (!cache).Add(e)
      if not (fun e) then
        yield !cache
        cache := new System.Collections.Generic.List()
    if cache.Length > 0 then
     yield !cache
  }

An alternative implementation could pass cache collection (as seq<'a>) to the function so it can see multiple elements to chose the break points.



回答4:

A Haskell solution, because I don't know F# syntax well, but it should be easy enough to translate:

type TimeStamp = Integer -- ticks
type TimeSpan  = Integer -- difference between TimeStamps

groupContiguousDataPoints :: TimeSpan -> [(TimeStamp, a)] -> [[(TimeStamp, a)]]

There is a function groupBy :: (a -> a -> Bool) -> [a] -> [[a]] in the Prelude:

The group function takes a list and returns a list of lists such that the concatenation of the result is equal to the argument. Moreover, each sublist in the result contains only equal elements. For example,

group "Mississippi" = ["M","i","ss","i","ss","i","pp","i"]

It is a special case of groupBy, which allows the programmer to supply their own equality test.

It isn't quite what we want, because it compares each element in the list with the first element of the current group, and we need to compare consecutive elements. If we had such a function groupBy1, we could write groupContiguousDataPoints easily:

groupContiguousDataPoints maxTimeDiff list = groupBy1 (\(t1, _) (t2, _) -> t2 - t1 <= maxTimeDiff) list

So let's write it!

groupBy1 :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy1 _    []            = [[]]
groupBy1 _    [x]           = [[x]]
groupBy1 comp (x : xs@(y : _))
  | comp x y  = (x : firstGroup) : otherGroups
  | otherwise = [x] : groups
  where groups@(firstGroup : otherGroups) = groupBy1 comp xs

UPDATE: it looks like F# doesn't let you pattern match on seq, so it isn't too easy to translate after all. However, this thread on HubFS shows a way to pattern match sequences by converting them to LazyList when needed.

UPDATE2: Haskell lists are lazy and generated as needed, so they correspond to F#'s LazyList (not to seq, because the generated data is cached (and garbage collected, of course, if you no longer hold a reference to it)).



回答5:

(EDIT: This suffers from a similar problem to Brian's solution, in that iterating the outer sequence without iterating over each inner sequence will mess things up badly!)

Here's a solution that nests sequence expressions. The imperitave nature of .NET's IEnumerable<T> is pretty apparent here, which makes it a bit harder to write idiomatic F# code for this problem, but hopefully it's still clear what's going on.

let groupBy cmp (sq:seq<_>) =
  let en = sq.GetEnumerator()
  let rec partitions (first:option<_>) =
    seq {
      match first with
      | Some first' ->             //'
        (* The following value is always overwritten; 
           it represents the first element of the next subsequence to output, if any *)
        let next = ref None

        (* This function generates a subsequence to output, 
           setting next appropriately as it goes *)
        let rec iter item = 
          seq {
            yield item
            if (en.MoveNext()) then
              let curr = en.Current
              if (cmp item curr) then
                yield! iter curr
              else // consumed one too many - pass it on as the start of the next sequence
                next := Some curr
            else
              next := None
          }
        yield iter first' (* ' generate the first sequence *)
        yield! partitions !next (* recursively generate all remaining sequences *)
      | None -> () // return an empty sequence if there are no more values
    }
  let first = if en.MoveNext() then Some en.Current else None
  partitions first

let groupContiguousDataPoints (time:TimeSpan) : (seq<DateTime*_> -> _) = 
  groupBy (fun (t,_) (t',_) -> t' - t <= time)


回答6:

Okay, trying again. Achieving the optimal amount of laziness turns out to be a bit difficult in F#... On the bright side, this is somewhat more functional than my last attempt, in that it doesn't use any ref cells.

let groupBy cmp (sq:seq<_>) =
  let en = sq.GetEnumerator()
  let next() = if en.MoveNext() then Some en.Current else None
  (* this function returns a pair containing the first sequence and a lazy option indicating the first element in the next sequence (if any) *)
  let rec seqStartingWith start =
    match next() with
    | Some y when cmp start y ->
        let rest_next = lazy seqStartingWith y // delay evaluation until forced - stores the rest of this sequence and the start of the next one as a pair
        seq { yield start; yield! fst (Lazy.force rest_next) }, 
          lazy Lazy.force (snd (Lazy.force rest_next))
    | next -> seq { yield start }, lazy next
  let rec iter start =
    seq {
      match (Lazy.force start) with
      | None -> ()
      | Some start -> 
          let (first,next) = seqStartingWith start
          yield first
          yield! iter next
    }
  Seq.cache (iter (lazy next()))


回答7:

Below is some code that does what I think you want. It is not idiomatic F#.

(It may be similar to Brian's answer, though I can't tell because I'm not familiar with the LazyList semantics.)

But it doesn't exactly match your test specification: Seq.length enumerates its entire input. Your "test code" calls Seq.length and then calls Seq.hd. That will generate an enumerator twice, and since there is no caching, things get messed up. I'm not sure if there is any clean way to allow multiple enumerators without caching. Frankly, seq<seq<'a>> may not be the best data structure for this problem.

Anyway, here's the code:

type State<'a> = Unstarted | InnerOkay of 'a | NeedNewInner of 'a | Finished

// f() = true means the neighbors should be kept together
// f() = false means they should be split
let split_up (f : 'a -> 'a -> bool) (input : seq<'a>) =
    // simple unfold that assumes f captured a mutable variable
    let iter f = Seq.unfold (fun _ -> 
        match f() with
        | Some(x) -> Some(x,())
        | None -> None) ()

    seq {
        let state = ref (Unstarted)
        use ie = input.GetEnumerator()

        let innerMoveNext() = 
            match !state with
            | Unstarted -> 
                if ie.MoveNext()
                then let cur = ie.Current
                     state := InnerOkay(cur); Some(cur)
                else state := Finished; None 
            | InnerOkay(last) ->
                if ie.MoveNext()
                then let cur = ie.Current
                     if f last cur
                     then state := InnerOkay(cur); Some(cur)
                     else state := NeedNewInner(cur); None
                else state := Finished; None
            | NeedNewInner(last) -> state := InnerOkay(last); Some(last)
            | Finished -> None 

        let outerMoveNext() =
            match !state with
            | Unstarted | NeedNewInner(_) -> Some(iter innerMoveNext)
            | InnerOkay(_) -> failwith "Move to next inner seq when current is active: undefined behavior."
            | Finished -> None

        yield! iter outerMoveNext }


open System

let groupContigs (contigTime : TimeSpan) (holey : seq<DateTime * int>) =
    split_up (fun (t1,_) (t2,_) -> (t2 - t1) <= contigTime) holey


// Test data
let numbers = {1 .. 15}
let contiguousTimeStamps = 
    let baseTime = DateTime.Now
    seq { for n in numbers -> baseTime.AddMinutes(float n)}

let holeyData = 
    Seq.zip contiguousTimeStamps numbers 
        |> Seq.filter (fun (dateTime, num) -> num % 7 <> 0)

let grouped_data = groupContigs (new TimeSpan(0,1,0)) holeyData


printfn "Consuming..."
for group in grouped_data do
    printfn "about to do a group"
    for x in group do
        printfn "  %A" x


回答8:

Ok, here's an answer I'm not unhappy with.

(EDIT: I am unhappy - it's wrong! No time to try to fix right now though.)

It uses a bit of imperative state, but it is not too difficult to follow (provided you recall that '!' is the F# dereference operator, and not 'not'). It is as lazy as possible, and takes a seq as input and returns a seq of seqs as output.

let N = 20
let data =  // produce some arbitrary data with holes
    seq {
        for x in 1..N do
            if x % 4 <> 0 && x % 7 <> 0 then
                printfn "producing %d" x
                yield x
    }
let rec GroupBy comp (input:seq<_>) = seq {
    let doneWithThisGroup = ref false
    let areMore = ref true
    use e = input.GetEnumerator()
    let Next() = areMore := e.MoveNext(); !areMore
    // deal with length 0 or 1, seed 'prev'
    if not(e.MoveNext()) then () else
    let prev = ref e.Current
    while !areMore do
        yield seq {
            while not(!doneWithThisGroup) do
                if Next() then
                    let next = e.Current 
                    doneWithThisGroup := not(comp !prev next)
                    yield !prev 
                    prev := next
                else
                    // end of list, yield final value
                    yield !prev
                    doneWithThisGroup := true } 
        doneWithThisGroup := false }
let result = data |> GroupBy (fun x y -> y = x + 1)
printfn "Consuming..."
for group in result do
    printfn "about to do a group"
    for x in group do
        printfn "  %d" x


标签: f# sequences