I've got an interresting problem: Given an IEnumerable<string>
, is it possible to yield a sequence of IEnumerable<IEnumerable<string>>
that groups identical adjacent strings in one pass?
Let me explain.
1. Basic illustrative sample :
Considering the following IEnumerable<string>
(pseudo representation):
{"a","b","b","b","c","c","d"}
How to get an IEnumerable<IEnumerable<string>>
that would yield something of the form:
{ // IEnumerable<IEnumerable<string>>
{"a"}, // IEnumerable<string>
{"b","b","b"}, // IEnumerable<string>
{"c","c"}, // IEnumerable<string>
{"d"} // IEnumerable<string>
}
The method prototype would be:
public IEnumerable<IEnumerable<string>> Group(IEnumerable<string> items)
{
// todo
}
But it could also be :
public void Group(IEnumerable<string> items, Action<IEnumerable<string>> action)
{
// todo
}
...where action
would be called for each subsequence.
2. More complicated sample
Ok, the first sample is very simple, and only aims to make the high level intent clear.
Now imagine we are dealing with IEnumerable<Anything>
, where Anything
is a type defined like this:
public class Anything
{
public string Key {get;set;}
public double Value {get;set;}
}
We now want to generate the subsequences based on the Key, (group every consecutive Anything
that have the same key) to later use them in order to calculate the total value by group:
public void Compute(IEnumerable<Anything> items)
{
Console.WriteLine(items.Sum(i=>i.Value));
}
// then somewhere, assuming the Group method
// that returns an IEnumerable<IEnumerable<Anything>> actually exists:
foreach(var subsequence in Group(allItems))
{
Compute(subsequence);
}
3. Important notes
- Only one iteration over the original sequence
- No intermediary collections allocations (we can assume millions of items in the original sequence, and millions consecutives items in each group)
- Keeping enumerators and defered execution behavior
- We can assume that resulting subsequences will be iterated only once, and will be iterated in order.
Is it possible, and how would you write it?
Way Better Solution That Meets All Requirements
OK, scrap my previous solution (I'll leave it below, just for reference). Here's a much better approach that occurred to me after making my initial post.
Write a new class that implements
IEnumerator<T>
and provides a few additional properties:IsValid
andPrevious
. This is all you really need to resolve the whole mess with having to maintain state inside an iterator block usingyield
.Here's how I did it (pretty trivial, as you can see):
(I called this a
ChipmunkEnumerator
because maintaining the previous value reminded me of how chipmunks have pouches in their cheeks where they keep nuts. Does it really matter? Stop making fun of me.)Now, utilizing this class in an extension method to provide exactly the behavior you want isn't so tough!
Notice that below I've defined
GroupConsecutive
to actually return anIEnumerable<IGrouping<TKey, T>>
for the simple reason that, if these are grouped by key anyway, it makes sense to return anIGrouping<TKey, T>
rather than just anIEnumerable<T>
. As it turns out, this will help us out later anyway...(To implement these methods, I wrote a simple
Grouping<TKey, T>
class that implementsIGrouping<TKey, T>
in the most straightforward way possible. I've omitted the code just so as to keep moving along...)OK, check it out. I think the code example below pretty well captures something resembling the more realistic scenario you described in your updated question.
Output:
Notice this also fixes the problem with my original answer of dealing with
IEnumerator<T>
objects that were value types. (With this approach, it doesn't matter.)There's still going to be a problem if you try calling
ToList
here, as you will find out if you try it. But considering you included deferred execution as a requirement, I doubt you would be doing that anyway. For aforeach
, it works.Original, Messy, and Somewhat Stupid Solution
Something tells me I'm going to get totally refuted for saying this, but...
Yes, it is possible (I think). See below for a damn messy solution I threw together. (Catches an exception to know when it's finished, so you know it's a great design!)
Now, Jon's point about there being a very real problem in the event that you try to do, for instance,
ToList
, and then access the values in the resulting list by index, is totally valid. But if your only intention here is to be able to loop over anIEnumerable<T>
using aforeach
-- and you're only doing this in your own code -- then, well, I think this could work for you.Anyway, here's a quick example of how it works:
Output:
And now for the (messy as crap) code:
Note that since this approach requires passing the actual enumerator around between a few different methods, it will not work if that enumerator is a value type, as calls to
MoveNext
in one method are only affecting a local copy.Is this what you are looking for?
This solution relies on object state because it's difficult to share state between two IEnumerable methods that use yield (no ref or out params).
Edit: Added extension method for cleaner usage. Fixed loop test logic so that "More" is evaluated first.
Edit: Dispose the enumerator when finished
Your second bullet is the problematic one. Here's why:
Here, it's trying to iterate over the fourth group and then the first group... that's clearly only going to work if all the groups are buffered or it can reread the sequence, neither of which is ideal.
I suspect you want a more "reactive" approach - I don't know offhand whether Reactive Extensions does what you want (the "consecutive" requirement is unusual) but you should basically provide some sort of action to be executed on each group... that way the method won't need to worry about having to return you something which could be used later on, after it's already finished reading.
Let me know if you'd like me to try to find a solution within Rx, or whether you would be happy with something like:
Here's a solution that I think satisfies your requirements, works with any type of data item, and is quite short and readable:
Notes:
TakeWhile
andSkipWhile
do it).SkipWhile
); it does iterate over the collection once more when you process the returned IEnumerables, but the partitioning itself iterates only once.while
condition to a test fornull
.If I am somehow mistaken, I 'd be especially interested in comments pointing out the mistakes!
Very Important Aside:
This solution will not allow you to enumerate the produced enumerables in any order other than the one it provides them in. However, I think the original poster has been pretty clear in comments that this is not a problem.