背景
使用Solr的4.0.0。 我已经收录了一组样本文件的文本并启用项向量,所以我可以使用快速向量高亮
<field name="raw_text" type="text_en" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
为突出我使用的是打破边界的Iterator用扫描仪句子边界。
<boundaryScanner name="breakIterator" class="solr.highlight.BreakIteratorBoundaryScanner">
<lst name="defaults">
<!-- type should be one of CHARACTER, WORD(default), LINE and SENTENCE -->
<str name="hl.bs.type">SENTENCE</str>
</lst>
</boundaryScanner>
我做一个简单的查询
http://localhost:8983/solr/documents/select?q=raw_text%3AArtibonite&wt=xml&hl=true&hl.fl=raw_text&hl.useFastVectorHighlighter=true&hl.snippets=100&hl.boundaryScanner=breakIterator
高亮工作相当好
<response>
...
<result name="response" numFound="5" start="0">
<doc>
<str name="id">-1071691270</str>
<str name="raw_text">
Final Report of the Independent Panel of Experts on the Cholera
Outbreak in Haiti Dr. Alejando Cravioto (Chair) International
Center for Diarrhoeal Disease Research, Dhaka, Bangladesh Dr.
Claudio F. Lanata Instituto de Investigación Nutricional, and
The US Navy Medical Research Unit 6, Lima, Peru Engr. Daniele
S. Lantagne Harvard University... ~SNIP~
</str>
<doc>
<lst name="highlighting">
<lst name="-1071691270">
<arr name="raw_text">
...
<str>
The timeline suggests that the outbreak spread along
the <em>Artibonite</em> River. After establishing that
the cases began in the upper reaches of the Artibonite
River, potential sources of contamination that could have
initiated the outbreak were investigated.
</str>
...
</arr>
</lst>
</lst>
问题
我希望能够发送所产生的句子作进一步处理(实体提取等),但我想原来的(长)文本字段内跟踪高亮句子的开始/结束偏移。 有没有简单的方法来做到这一点?
它会更好设置hl.fragsize返回整场再处理/提取所关注的句子这样?