I have html file:
<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;">
<div>Test message.</div>
<div> </div>
<div>More content here...</div>
<div> </div>
<div>Best regards,</div>
<div>Mr. Crowley</div></div></body></html>
I try to get content of the file above using Apache Tika...
final InputStream input = new FileInputStream("file.html");
final ContentHandler handler = new BodyContentHandler();
final Metadata metadata = new Metadata();
final HtmlParser htmlParser = new HtmlParser();
htmlParser.parse(input, handler, metadata, new ParseContext());
String plainText = handler.toString();
System.out.println(plainText);
...and all is fine except extra linebreaks:
Test message.
More content here...
Best regards,
Mr. Crowley
<and 3 empty lines here>
Is it possible to avoid this behavior? Is it possible to get more expected result:
Test message.
More content here...
Best regards,
Mr. Crowley
?
Code constructions like
plainText = plainText.replaceAll("(\n)+", "\n");
are unfortunately impossible here for me. Also I can't change the structure of my HTML file.