I am using LINQ to Objects and wonder if it is possible to improve the performance of my queries by making use of an index that I have. This is best explained with an example. Imagine a simple type...
public class Person
{
public int Age;
public string FirstName;
public string LastName;
}
And a simple query I would make against it...
List<Person> people = new List<Person>();
// 'people' populated with 50,000 instances...
var x = from t in people
where t.Age > 18 && t.Age < 21
select t;
If I understand LINQ to Objects correctly then the implementation of the Where extension method will enumerate all 50,000 instances in the people collection in order to find the 100 that actually match. As it happens I already have an index of the people collection that is sorted by Age. Like this...
SortedList<int, Person> ageSorted = new SortedList<int, Person>();
Clearly it would make sense if I could get the Where to use the SortedList so that it no longer has to enumerate all 50,000 instances, instead finding the range of 100 matching entries and so saving time.
Is it possible to extend LINQ to Objects to enable my situation? Is it already possible but I am missing the technique?
There's already a project which I believe does exactly that - i4o. I can't say I've used it myself, but it sounds like the kind of thing you want... you may need to juggle your existing code a bit, but it's certainly worth looking at.
If that doesn't help, you could at least write your own extension methods on
SortedList<TKey, TValue>
. You probably wouldn't be able to easily use your actualwhere
clause, but you could use your own methods taking a minimum and a maximum value. You might also want to make them apply toIList<T>
where you assert that you've already sorted the values appropriately (according to some comparer).For example (completely untested):
(If you only have
List<T>
instead ofIList<T>
, you could useList<T>.BinarySearch
, although you'd need to build a customIComparer<T>
.)Also, have a look at
SortedSet<T>
in .NET 4.The LINQ query syntax actually uses any extension method named
Where
that matches the signature, so you can always write your own that handles your specific type the way you want....
Then you can apply any logic you want. I believe the normal method overload rules apply for determining which
Where
extension method would be called.In a pre-sorted container, the efficiency is achieved by finding the first element quickly. Once you find the first element, just linearly retrieve the following elements until you find the end of your range.
Assuming your
SortedList
is sorted byPerson.Age
, you can find the first element of the range usingSortedList.IndexOfKey
, which is a binary search algorithm; therefore, this method is an O(log n) operation.(I don't think you can change your code so the
Enumerable.Where
suddenly becomes more intelligent and finds the range start by using binary search.)--- EDIT ---
Actually, what you really need is
List.BinarySearch
orArray.BinarySearch
. TheSortedList.IndexOfKey
won't let you get the index of the closest match in case exact match does not exist. Or you can just implement the binary search yourself.You're right that the query you wrote will enumerate the whole list as obviously LINQ can't assume anything about your data.
If you have a SortedList, you can exploit that using the SkipWhile/TakeWhile linq methods:
EDIT
@Davy8 is right of course that worst case this still has the same performance. See the other answers for a way to more quickly find the first value.
If you need to do this operation more than once for different age ranges then you can probably also speed it up by grouping on age: