I have many HTML files from which I need to extract text. If it's all on one line, I can do that quite easily but if the tag wraps around or is on multiple lines I can't figure how to do this. Here's what I mean:
<section id="MySection">
Some text here
another line here <br>
last line of text.
</section>
I'm not concerned about the <br>
text, unless it will help wrap the text around. The area that I want always begins with "MySection" and then is ended with </section>
. What I'd like to end up with is something like this:
Some text here another line here last line of text.
I'd prefer something like a vbscript or command line option (sed?) but I'm not sure where to begin. Any help?
Normally you'd use the Internet Explorer COM object for this:
root = "C:\base\dir"
Set ie = CreateObject("InternetExplorer.Application")
For Each f In fso.GetFolder(root).Files
ie.Navigate "file:///" & f.Path
While ie.Busy : WScript.Sleep 100 : Wend
text = ie.document.getElementById("MySection").innerText
WScript.Echo Replace(text, vbNewLine, "")
Next
However, the <section>
tag is not supported prior to IE 9, and even in IE 9 the COM object doesn't seem to handle it correctly, as getElementById("MySection")
only returns the opening tag:
>>> wsh.echo ie.document.getelementbyid("MySection").outerhtml
<SECTION id=MySection>
You could use a regular expression instead, though:
root = "C:\base\dir"
Set fso = CreateObject("Scripting.FileSystemObject")
Set re1 = New RegExp
re1.Pattern = "<section id=""MySection"">([\s\S]*?)</section>"
re1.Global = False
re2.IgnoreCase = True
Set re2 = New RegExp
re2.Pattern = "(<br>|\s)+"
re2.Global = True
re2.IgnoreCase = True
For Each f In fso.GetFolder(root).Files
html = fso.OpenTextFile(filename).ReadAll
Set m = re1.Execute(html)
If m.Count > 0 Then
text = Trim(re2.Replace(m.SubMatches(0).Value, " "))
End If
WScript.Echo text
Next
Here a one-liner solution using perl
and a HTML parser from Mojolicious
framework:
perl -MMojo::DOM -E '
say Mojo::DOM->new( do { undef $/; <> } )->at( q|#MySection| )->text
' index.html
Assuming index.html
with following content:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
</head>
<body id="portada">
<section id="MySection">
Some text here
another line here <br>
last line of text.
</section>
</body>
</html>
It yields:
Some text here another line here last line of text.