How to index html content, keeping positions (as x

2019-08-10 09:51发布

问题:

I want to create a full-text search index for HTML content (to be more specific: EPUB chapters in XHTML format). Like this:

...
<p>Lorem ipsum <b>dolor</b> sit amet, consectetur adipiscing elit.</p>
...

The problem is that I need somehow the matched text's position (like xpath) with search results, because i need to position the reader software to the right place. I need a functionality like highlight feature, but instead of highlighted text, give the where-to-highlight position of matches. So if i search for "dolor" it gives back something like this:

matches:[
...
  {"match":"dolor", "xpath":"//*[@id="lipsum"]/p[1]/b"}
...
]

The standard scenario (what i found everywhere) like strip html chars with filter, then tokenize, etc, not applies here, because it loses the position information in the first step.

Any suggestions? Is that even possible with Solr or ElasticSearch? Thanks!

回答1:

Your question is about xpath as result of highlighting for a xhtml-Dokument.

I do not know about a running solution in solr or elasticsearch. There is something very similar in the eXtensible Text Framework(´XTF´) which is build on (an old version of) Lucene. In XTF you can get the highlighting as tags in the original xml-File. So it should be easy the write an xsl-Transformation to generate the corresponding xpaths.

Main idea in short would be to split the EPUB-book in overlapping chunks and store the xml-structure as special characters in the indexed and stored field. With highlighting information you can then reconvert the original xml-structure to find your xpaths.