Suppose I have the following code:
var X = XElement.Parse (@"
<ROOT>
<MUL v='2' />
<MUL v='3' />
</ROOT>
");
Enumerable.Range (1, 100)
.Select (s => X.Elements ()
.Select (t => Int32.Parse (t.Attribute ("v").Value))
.Aggregate (s, (t, u) => t * u)
)
.ToList ()
.ForEach (s => Console.WriteLine (s));
What is the .NET runtime actually doing here? Is it parsing and converting the attributes to integers each of the 100 times, or is it smart enough to figure out that it should cache the parsed values and not repeat the computation for each element in the range?
Moreover, how would I go about figuring out something like this myself?
Thanks in advance for your help.
It has been a while since I dug through this code but, IIRC, the way Select
works is to simply cache the Func
you supply it and run it on the source collection one at a time. So, for each element in the outer range, it will run the inner Select/Aggregate
sequence as if it were the first time. There isn't any built-in caching going on -- you would have to implement that yourself in the expressions.
If you wanted to figure this out yourself, you've got three basic options:
- Compile the code and use
ildasm
to view the IL; it's the most accurate but, especially with lambdas and closures, what you get from IL may look nothing like what you put into the C# compiler.
- Use something like dotPeek to decompile System.Linq.dll into C#; again, what you get out of these kinds of tools may only approximately resemble the original source code, but at least it will be C# (and dotPeek in particular does a pretty good job, and is free.)
- My personal preference - download the .NET 4.0 Reference Source and look for yourself; this is what it's for :) You have to just trust MS that the reference source matches the actual source used to produce the binaries, but I don't see any good reason to doubt them.
- As pointed out by @AllonGuralnek you can set breakpoints on specific lambda expressions within a single line; put your cursor somewhere inside the body of the lambda and press F9 and it will breakpoint just the lambda. (If you do it wrong, it will highlight the entire line in the breakpoint color; if you do it right, it will just highlight the lambda.)
LINQ and IEnumerable<T>
is pull based. This means that the predicates and actions that are part of the LINQ statement in general are not executed until values are pulled. Furthermore the predicates and actions will execute each time values are pulled (e.g. there is no secret caching going on).
Pulling from an IEnumerable<T>
is done by the foreach
statement which really is syntactic sugar for getting an enumerator by calling IEnumerable<T>.GetEnumerator()
and repeatedly calling IEnumerator<T>.MoveNext()
to pull the values.
LINQ operators like ToList()
, ToArray()
, ToDictionary()
and ToLookup()
wraps a foreach
statement so these methods will do a pull. The same can be said about operators like Aggregate()
, Count()
and First()
. These methods have in common that they produce a single result that has to be created by executing a foreach
statement.
Many LINQ operators produce a new IEnumerable<T>
sequence. When an element is pulled from the resulting sequence the operator pulls one or more elements from the source sequence. The Select()
operator is the most obvious example but other examples are SelectMany()
, Where()
, Concat()
, Union()
, Distinct()
, Skip()
and Take()
. These operators don't do any caching. When then N'th element is pulled from a Select()
it pulls the N´th element from the source sequence, applies the projection using the action supplied and returns it. Nothing secret going on here.
Other LINQ operators also produce new IEnumerable<T>
sequences but they are implemented by actually pulling the entire source sequence, doing their job and then producing a new sequence. These methods include Reverse()
, OrderBy()
and GroupBy()
. However, the pull done by the operator is only performed when the operator itself is pulled meaning that you still need a foreach
loop "at the end" of the LINQ statement before anything is executed. You could argue that these operators use a cache because they immediately pull the entire source sequence. However, this cache is built each time the operator is iterated so it is really an implementation detail and not something that will magically detect that you are applying the same OrderBy()
operation multiple times to the same sequence.
In your example the ToList()
will do a pull. The action in the outer Select
will execute 100 times. Each time this action is executed the Aggregate()
will do another pull that will parse the XML attributes. In total your code will call Int32.Parse()
200 times.
You can improve this by pulling the attributes once instead of on each iteration:
var X = XElement.Parse (@"
<ROOT>
<MUL v='2' />
<MUL v='3' />
</ROOT>
")
.Elements ()
.Select (t => Int32.Parse (t.Attribute ("v").Value))
.ToList ();
Enumerable.Range (1, 100)
.Select (s => x.Aggregate (s, (t, u) => t * u))
.ToList ()
.ForEach (s => Console.WriteLine (s));
Now Int32.Parse()
is only called 2 times. However, the cost is that a list of attribute values have to be allocated, stored and eventually garbage collected. (Not a big concern when the list contains two elements.)
Note that if you forget the first ToList()
that pulls the attributes the code will still run but with the exact same performance characteristics as the original code. No space is used to store the attributes but they are parsed on each iteration.