I have many HTML files from which I need to extract text. If it's all on one line, I can do that quite easily but if the tag wraps around or is on multiple lines I can't figure how to do this. Here's what I mean:
<section id="MySection">
Some text here
another line here <br>
last line of text.
</section>
I'm not concerned about the <br>
text, unless it will help wrap the text around. The area that I want always begins with "MySection" and then is ended with </section>
. What I'd like to end up with is something like this:
Some text here another line here last line of text.
I'd prefer something like a vbscript or command line option (sed?) but I'm not sure where to begin. Any help?
Here a one-liner solution using
perl
and a HTML parser fromMojolicious
framework:Assuming
index.html
with following content:It yields:
Normally you'd use the Internet Explorer COM object for this:
However, the
<section>
tag is not supported prior to IE 9, and even in IE 9 the COM object doesn't seem to handle it correctly, asgetElementById("MySection")
only returns the opening tag:You could use a regular expression instead, though: