I wrote some screen-scraping code in PowerShell and was surprised that it took around 30 seconds to parse a few HTML tables. I stripped it down to try and figure out where all the time was being spent, and it seems to be in the getElementsByTagName
calls.
I've included a script below which on both my home desktop, my work desktop and my home slate, takes around 1-2 seconds for each iteration (full results pasted below). However, other people in the PowerShell community are reporting far shorter times (only several milliseconds for each iteration).
I'm struggling to find any way of narrowing down the problem, and there doesn't seem to be a pattern to the OS/PS/.NET/IE versions.
The desktop I'm currently running it on is a brand new Windows 8 install with only PS3 and .NET 4.5 installed (and all Windows Update patches). No Visual Studio. No PowerShell profile.
$url = "http://www.icy-veins.com/restoration-shaman-wow-pve-healing-gear-loot-best-in-slot"
$response = (iwr $url).ParsedHtml
# Loop through the h2 tags
$response.body.getElementsByTagName("h2") | foreach {
# Get the table that comes after the heading
$slotTable = $_.nextSibling
# Grab the rows from the table, skipping the first row (column headers)
measure-command { $rows = $slotTable.getElementsByTagName("tr") | select -Skip 1 } | select TotalMilliseconds
}
Results from my desktop (the work PC and slate give near identical results):
TotalMilliseconds
-----------------
1575.7633
2371.5566
1073.7552
2307.8844
1779.5518
1063.9977
1588.5112
1372.4927
1248.7245
1718.3555
3283.843
2931.1616
2557.8595
1230.5093
995.2934
However, some people in the Google+ PowerShell community reported results like this:
TotalMilliseconds
-----------------
76.9098
112.6745
56.6522
140.5845
84.9599
48.6669
79.9283
73.4511
94.0683
81.4443
147.809
139.2805
111.4078
56.3881
41.3386
I've tried both PowerShell ISE and a standard console, no difference. For the work being done, these times seem kinda excessive, and judging by the posts in the Google+ community, it can go quicker!