Parse local HTML file

2019-01-07 21:56发布

问题:

I can use PowerShell to parse an HTML page

PS > $foo = Invoke-WebRequest http://example.com

PS > $foo.Links.Count
1

However if I download the page

PS > Invoke-WebRequest -OutFile example.htm http://example.com

and then try to parse the downloaded page it gives unexpected result

PS > $foo = Invoke-WebRequest file://$pwd/example.htm

PS > $foo.Links.Count
0

How can I parse the local downloaded page?

回答1:

It appears that Invoke-WebRequest loads file protocol URIs just fine, but fails to parse them even in PowerShell 4.0 (where it is officially supported).

An alternative that does not require setting up a website would be to load and parse HTML directly into MSHTML.

$html = New-Object -ComObject "HTMLFile";
$source = Get-Content -Path "file.html" -Raw;
$html.IHTMLDocument2_write($source);

$html.links.length;

Note that when I tested this, a single

<meta http-equiv="X-UA-Compatible" content="IE=edge" />

header prevented my HTML from parsing and I have no idea why -- the document had similar XHTML-style headers and MSHTML had no issues with those.



回答2:

You can use the file with a web server to get around the dumb limitation of Invoke-WebRequest

PS > $foo = Invoke-WebRequest http://localhost:8080/example.htm

PS > $foo.Links.Count
1

Note this will work even with no connection, example

PS > Invoke-WebRequest http://example.com
Invoke-WebRequest : The remote name could not be resolved: 'example.com'


回答3:

Use file-link format

$foo = Invoke-WebRequest "file:///<path-to-file>"