Possible Duplicate:
Which CPAN module would you recommend for turning HTML into plain text?
Question:
- Is there a module to render HTML, specifically to gather the text, while adhering to font-style tags, such as
<tt>
,<b>
,<i>
, etc and break-line<br>
, similar to Lynx.
For example:
# cat test.html
<body>
<div id="foo" class="blah">
<tt>test<br>
<b>test</b><br>
whatever<br>
test</tt>
</div>
</body>
# lynx.exe --dump test.html
test
test
whatever
test
Note: the second line should be bold.
I am on Windows so I cannot fully test this but you can adapt htext that comes with HTML::Parser:
Go to search.cpan.org and search for HTML text which will give you lots of options to suit your particular needs. HTML::FormatText is a good baseline, and then branch out into specific variations of it, for example HTML::FormatText::WithLinks if you want to preserve links as footnotes.
Lynx is a big program and its html rendering will be non trivial.
How about this: