I want to create a full-text search index for HTML content (to be more specific: EPUB chapters in XHTML format). Like this:
...
<p>Lorem ipsum <b>dolor</b> sit amet, consectetur adipiscing elit.</p>
...
The problem is that I need somehow the matched text's position (like xpath) with search results, because i need to position the reader software to the right place. I need a functionality like highlight feature, but instead of highlighted text, give the where-to-highlight position of matches. So if i search for "dolor" it gives back something like this:
matches:[
...
{"match":"dolor", "xpath":"//*[@id="lipsum"]/p[1]/b"}
...
]
The standard scenario (what i found everywhere) like strip html chars with filter, then tokenize, etc, not applies here, because it loses the position information in the first step.
Any suggestions? Is that even possible with Solr or ElasticSearch? Thanks!