Let's say I'm using the LINQ array .Distinct()
method.
The result is unordered.
Well, everything is "ordered" if you know the logic used to produce the result.
My question is about the result set. Will the resulting array be in the "first distinct" order or perhaps the "last distinct" order?
Can I never count on any order?
This is the old "remove duplicate strings" problem but I'm looking into the LINQ solution.
Assuming you mean LINQ to Objects, it basically keeps a set of all the results it's returned so far, and only yields the "current" item if it hasn't been yielded before. So the results are in the original order, with duplicates removed. Something like this (except with error checking etc):
public static IEnumerable<T> Distinct<T>(this IEnumerable<T> source)
{
HashSet<T> set = new HashSet<T>();
foreach (T item in source)
{
if (set.Add(item))
{
// New item, so yield it
yield return item;
}
}
}
This isn't guaranteed - but I can't imagine any more sensible implementation. This allows Distinct()
to be as lazy as it can be - data is returned as soon as it can be, and only the minimum amount of data is buffered.
Relying on this would be a bad idea, but it can be instructive to know how the current implementation (apparently) works. In particular, you can easily observe that it starts returning data before exhausting the original sequence, simply by creating a source which logs when it produces data to be consumed by Distinct
, and also logging when you receive data from Distinct
.
The docs say:
"The result sequence is unordered."
You can never count on any order. It would be entirely permissible for LINQ to implement this using hash tables (and indeed, I believe it IS implemented that way in .NET 4).
The Distinct method doesn't officially guarantee an order as far as I know, although in practice the LINQ to Objects implementation returns the groups in the order they first appear in the source enumerable.
If you use LINQ to SQL for example then it is up to the database to decide what order it wishes to return the results in and then you should not rely on this order even being consistent from one call to the next.
At a guess it's using a hash table to produce the set of distinct keys, and producing the output in order by the hashes.