LINQ to Objects and improved perf with an Index?

2020-06-03 06:25发布

I am using LINQ to Objects and wonder if it is possible to improve the performance of my queries by making use of an index that I have. This is best explained with an example. Imagine a simple type...

public class Person
{
    public int Age;
    public string FirstName;
    public string LastName;
}

And a simple query I would make against it...

List<Person> people = new List<Person>();

// 'people' populated with 50,000 instances...

var x = from t in people
        where t.Age > 18 && t.Age < 21
        select t;

If I understand LINQ to Objects correctly then the implementation of the Where extension method will enumerate all 50,000 instances in the people collection in order to find the 100 that actually match. As it happens I already have an index of the people collection that is sorted by Age. Like this...

SortedList<int, Person> ageSorted = new SortedList<int, Person>();

Clearly it would make sense if I could get the Where to use the SortedList so that it no longer has to enumerate all 50,000 instances, instead finding the range of 100 matching entries and so saving time.

Is it possible to extend LINQ to Objects to enable my situation? Is it already possible but I am missing the technique?

4条回答
成全新的幸福
2楼-- · 2020-06-03 06:38

There's already a project which I believe does exactly that - i4o. I can't say I've used it myself, but it sounds like the kind of thing you want... you may need to juggle your existing code a bit, but it's certainly worth looking at.

If that doesn't help, you could at least write your own extension methods on SortedList<TKey, TValue>. You probably wouldn't be able to easily use your actual where clause, but you could use your own methods taking a minimum and a maximum value. You might also want to make them apply to IList<T> where you assert that you've already sorted the values appropriately (according to some comparer).

For example (completely untested):

public static IEnumerable<T> Between<T, TKey>(this IList<T> source,
                                              Func<T, TKey> projection,
                                              TKey minKeyInclusive,
                                              TKey maxKeyExclusive,
                                              IComparer<TKey> comparer)
{
    comparer = comparer ?? Comparer<TKey>.Default;

    // TODO: Find the index of the lower bound via a binary search :)
    // (It's too late for me to jot it down tonight :)
    int index = ...; // Find minimum index

    while (index < source.Count &&
           comparer.Compare(projection(source[index]), maxKeyExclusive) < 0)
    {
        yield return source[index];
        index++;
    }
}

(If you only have List<T> instead of IList<T>, you could use List<T>.BinarySearch, although you'd need to build a custom IComparer<T>.)

Also, have a look at SortedSet<T> in .NET 4.

查看更多
贪生不怕死
3楼-- · 2020-06-03 06:41

The LINQ query syntax actually uses any extension method named Where that matches the signature, so you can always write your own that handles your specific type the way you want.

    public static IEnumerable<Person> Where(this IEnumerable<Person> collection, Func<Person, bool> condition )
    {
        Console.WriteLine("My Custom 'Where' method called");
        return System.Linq.Enumerable.Where(collection, condition);
    }

...

        var x = from t in people
                where t.Age > 18 && t.Age < 21
                select t; //Will print "My Custom 'Where' method called"

Then you can apply any logic you want. I believe the normal method overload rules apply for determining which Where extension method would be called.

查看更多
放我归山
4楼-- · 2020-06-03 06:51

In a pre-sorted container, the efficiency is achieved by finding the first element quickly. Once you find the first element, just linearly retrieve the following elements until you find the end of your range.

Assuming your SortedList is sorted by Person.Age, you can find the first element of the range using SortedList.IndexOfKey, which is a binary search algorithm; therefore, this method is an O(log n) operation.

(I don't think you can change your code so the Enumerable.Where suddenly becomes more intelligent and finds the range start by using binary search.)

--- EDIT ---

Actually, what you really need is List.BinarySearch or Array.BinarySearch. The SortedList.IndexOfKey won't let you get the index of the closest match in case exact match does not exist. Or you can just implement the binary search yourself.

查看更多
够拽才男人
5楼-- · 2020-06-03 07:03

You're right that the query you wrote will enumerate the whole list as obviously LINQ can't assume anything about your data.

If you have a SortedList, you can exploit that using the SkipWhile/TakeWhile linq methods:

 var x = x.SkipWhile(kv => kv.Key <= 18).TakeWhile(kv => kv.Key < 21)

EDIT

@Davy8 is right of course that worst case this still has the same performance. See the other answers for a way to more quickly find the first value.

If you need to do this operation more than once for different age ranges then you can probably also speed it up by grouping on age:

var byAge = people.GroupBy(p => p.Age);

var x = from grp in byAge 
        where grp.Key > 18 && grp.Key < 21
        from person in grp
        select person;
查看更多
登录 后发表回答