Extract text between two links in HTML through Jav

2019-06-07 13:36发布

I am trying to retrieve the text data from an ePub file using Java. The text of the ePub file lies within a HTML file that is formatted something like this -

<h2 id="pgepubid00001">Chapter I</h2>

<p>Some text</p>
<p>Another line of Text</p>

<br/>

<h2 id="pgepubid00001">Chapter II</h2>

etc..

Before opening this file I already know the id of the Chapter I need to extract and can find the id of the next chapter too. Because of this I thought a logical approach would be to attempt to parse it in a SAX parser and extract the text in each paragraph until I reached the link of the next chapter. But this is proving quite a task.

Of course, everything is dynamic so there is no set link to go to etc. The HTML is semi-strictly formatted so I didn't expect parsing to be so much of a problem. Can anyone recommend a good way to extract the text needed?

The solution needs to be JAVA ONLY, no other languages can be used. I am looking to implement this in an Android device

标签： java android xml parsing epub

1条回答

爱情/是我丢掉的垃圾

2楼-- · 2019-06-07 14:12

Well, you know ids of the chapters, why not use String.indexOf ?

start = text.indexOf("<h2 id=\"pgepubid00001\">");
end = text.indexOf("<h2 id=\"pgepubid00002\">");

whatYoureLookingFor = text.substring(start, end-start)

Keep it simple.

0人赞添加讨论(0) 举报

Extract text between two links in HTML through Jav

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间