Create batches in linq

2018-12-31 08:51发布

Can someone suggest a way to create batches of a certain size in linq?

Ideally I want to be able to perform operations in chunks of some configurable amount.

标签: c# linq
13条回答
梦寄多情
2楼-- · 2018-12-31 09:09

All of the above perform terribly with large batches or low memory space. Had to write my own that will pipeline (notice no item accumulation anywhere):

public static class BatchLinq {
    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size) {
        if (size <= 0)
            throw new ArgumentOutOfRangeException("size", "Must be greater than zero.");

        using (IEnumerator<T> enumerator = source.GetEnumerator())
            while (enumerator.MoveNext())
                yield return TakeIEnumerator(enumerator, size);
    }

    private static IEnumerable<T> TakeIEnumerator<T>(IEnumerator<T> source, int size) {
        int i = 0;
        do
            yield return source.Current;
        while (++i < size && source.MoveNext());
    }
}

Edit: Known issue with this approach is that each batch must be enumerated and enumerated fully before moving to the next batch. For example this doesn't work:

//Select first item of every 100 items
Batch(list, 100).Select(b => b.First())
查看更多
骚的不知所云
3楼-- · 2018-12-31 09:12
    static IEnumerable<IEnumerable<T>> TakeBatch<T>(IEnumerable<T> ts,int batchSize)
    {
        return from @group in ts.Select((x, i) => new { x, i }).ToLookup(xi => xi.i / batchSize)
               select @group.Select(xi => xi.x);
    }
查看更多
初与友歌
4楼-- · 2018-12-31 09:15

If you start with sequence defined as an IEnumerable<T>, and you know that it can safely be enumerated multiple times (e.g. because it is an array or a list), you can just use this simple pattern to process the elements in batches:

while (sequence.Any())
{
    var batch = sequence.Take(10);
    sequence = sequence.Skip(10);

    // do whatever you need to do with each batch here
}
查看更多
后来的你喜欢了谁
5楼-- · 2018-12-31 09:15

So with a functional hat on, this appears trivial....but in C#, there are some significant downsides.

you'd probably view this as an unfold of IEnumerable (google it and you'll probably end up in some Haskell docs, but there may be some F# stuff using unfold, if you know F#, squint at the Haskell docs and it will make sense).

Unfold is related to fold ("aggregate") except rather than iterating through the input IEnumerable, it iterates through the output data structures (its a similar relationship between IEnumerable and IObservable, in fact I think IObservable does implement an "unfold" called generate...)

anyway first you need an unfold method, I think this works;

    static IEnumerable<T> Unfold<T, U>(Func<U, IEnumerable<Tuple<U, T>>> f, U seed)
    {
        var maybeNewSeedAndElement = f(seed);

        return maybeNewSeedAndElement.SelectMany(x => new[] { x.Item2 }.Concat(Unfold(f, x.Item1)));
    }

this is a bit obtuse because C# doesn't implement some of the things functional langauges take for granted...but it basically takes a seed and then generates a "Maybe" answer of the next element in the IEnumerable and the next seed (Maybe doesn't exist in C#, so we've used IEnumerable to fake it), and concatenates the rest of the answer (I can't vouch for the "O(n?)" complexity of this).

Once you've done that then;

    static IEnumerable<IEnumerable<T>> Batch<T>(IEnumerable<T> xs, int n)
    {
        return Unfold(ys =>
            {
                var head = ys.Take(n);
                var tail = ys.Skip(n);
                return head.Take(1).Select(_ => Tuple.Create(tail, head));
            },
            xs);
    }

it all looks quite clean...you take the "n" elements as the "next" element in the IEnumerable, and the "tail" is the rest of the unprocessed list.

if there is nothing in the head...you're over...you return "Nothing" (but faked as an empty IEnumerable>)...else you return the head element and the tail to process.

you probably can do this using IObservable, there's probably a "Batch" like method already there, and you can probably use that.

If the risk of stack overflows worries (it probably should), then you should implement in F# (and there's probably some F# library (FSharpX?) already with this).

(I have only done some rudimentary tests of this, so there may be the odd bugs in there).

查看更多
一个人的天荒地老
6楼-- · 2018-12-31 09:16

I'm joining this very late but i found something more interesting.

So we can use here Skip and Take for better performance.

public static class MyExtensions
    {
        public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> items, int maxItems)
        {
            return items.Select((item, index) => new { item, index })
                        .GroupBy(x => x.index / maxItems)
                        .Select(g => g.Select(x => x.item));
        }

        public static IEnumerable<T> Batch2<T>(this IEnumerable<T> items, int skip, int take)
        {
            return items.Skip(skip).Take(take);
        }

    }

Next I checked with 100000 records. The looping only is taking more time in case of Batch

Code Of console application.

static void Main(string[] args)
{
    List<string> Ids = GetData("First");
    List<string> Ids2 = GetData("tsriF");

    Stopwatch FirstWatch = new Stopwatch();
    FirstWatch.Start();
    foreach (var batch in Ids2.Batch(5000))
    {
        // Console.WriteLine("Batch Ouput:= " + string.Join(",", batch));
    }
    FirstWatch.Stop();
    Console.WriteLine("Done Processing time taken:= "+ FirstWatch.Elapsed.ToString());


    Stopwatch Second = new Stopwatch();

    Second.Start();
    int Length = Ids2.Count;
    int StartIndex = 0;
    int BatchSize = 5000;
    while (Length > 0)
    {
        var SecBatch = Ids2.Batch2(StartIndex, BatchSize);
        // Console.WriteLine("Second Batch Ouput:= " + string.Join(",", SecBatch));
        Length = Length - BatchSize;
        StartIndex += BatchSize;
    }

    Second.Stop();
    Console.WriteLine("Done Processing time taken Second:= " + Second.Elapsed.ToString());
    Console.ReadKey();
}

static List<string> GetData(string name)
{
    List<string> Data = new List<string>();
    for (int i = 0; i < 100000; i++)
    {
        Data.Add(string.Format("{0} {1}", name, i.ToString()));
    }

    return Data;
}

Time taken Is like this.

First - 00:00:00.0708 , 00:00:00.0660

Second (Take and Skip One) - 00:00:00.0008, 00:00:00.0008

查看更多
琉璃瓶的回忆
7楼-- · 2018-12-31 09:17

Another way is using Rx Buffer operator

//using System.Linq;
//using System.Reactive.Linq;
//using System.Reactive.Threading.Tasks;

var observableBatches = anAnumerable.ToObservable().Buffer(size);

var batches = aList.ToObservable().Buffer(size).ToList().ToTask().GetAwaiter().GetResult();
查看更多
登录 后发表回答