Solr: Using Regex fragmenter to extract paragraphs

2020-04-16 04:34发布

I posted this message to the Solr mailing list, but I'm trying here too in case there's a Solr expert lurking around.

I am trying to use the regex fragmenter and am having a hard time getting the results I want. I am trying to get fragments that start on a word character and end on punctuation, but for some reason the fragments being returned to me seem to be very inflexible, despite that I've provided a large slop. Here are the relevant parameters I'm using, maybe someone can help point out where I've gone wrong:

<str name="hl.fragsize">500</str>
<str name="hl.fragmenter">regex</str>
<str name="hl.regex.slop">0.8</str>
<str name="hl.regex.pattern">[\w].*{400,600}[.!?]</str>
<str name="hl">true</str>
<str name="q">chinese</str>

This should be matching between 400-600 characters, beginning with a word character and ending with one of .!?. Here is an example of a typical result:

. Check these pictures out. Nine panda cubs on display for the first time Thursday in southwest China. They're less than a year old. They just recently stopped nursing. There are only 1,600 of these guys left in the mountain forests of central China, another 120 in Chinese breeding facilities and zoos. And they're about 20 that live outside China in zoos. They exist almost entirely on bamboo. They can live to be 30 years old. And these little guys will eventually get much bigger. They'll grow

As you can see, it is starting with a period and ending on a word character! It's almost as if the fragments are just coming out as they will and the regex isn't doing anything at all, but the results are different when I use the gap fragmenter. In the above result I don't see any reason why it shouldn't have stripped out the preceding period and the last two words, there is plenty of room in the slop and in the regex pattern. Please help me figure out what I'm doing wrong...

Thanks a lot,

Mark

3条回答
仙女界的扛把子
2楼-- · 2020-04-16 05:01

There seems to be a problem if you are using a WordDelimiterFilterFactory. The problem is described here http://www.mail-archive.com/solr-user@lucene.apache.org/msg30631.html

As described in the link above, one solution might be to add preserveOriginal="1" to your WordDelimiterFilterFactory. I tried this and it worked for me. However, (being new to SOLR) I don't know whether there are any drawbacks to this approach (apart from increasing the index size).

查看更多
Emotional °昔
3楼-- · 2020-04-16 05:06

I've never heard of the tool you're working with (Solr), but the quantifiers in your regular expression are definitely wrong. This regex will match between 402 and 602 characters, where the first is a word character, and the last is one of three punctuation characters:

\w.{400,600}[.!?]

The dot and question mark are not metacharacters inside a character class, so there's no point escaping them. \w can stand on its own.

Since the dot also matches the 3 punctuation characters, your regex will match as many characters as possible (up to 602), and then give back to make sure the last one is one of your 3 punctuation characters.

If you want to prioritize shorter runs, use a lazy quantifier:

\w.{400,600}?[.!?]

If you want your regex to match only one sentence, use a negated character class:

\w[^.!?]{400,600}[.!?]

All of the above assumes that Solr uses Perl-style regular expressions. Things like \w and {400,600} don't work in all regex flavors.

查看更多
We Are One
4楼-- · 2020-04-16 05:21

Try:

\w[^\.!\?]{400,600}[\.!\?]

You should not need the first square brackets around \w

And you should escape the final dot.

And I do not think .* just before another quantifier ({400,600})is a good idea, hence the .{400,600}

Since ? is a special character in regex, you should also escape it.

And since . matches anything, you should rather use [^\.!\?] in order to match anything but your ending characters.

查看更多
登录 后发表回答