Adding a multi-valued string field to a Lucene Doc

2019-03-27 11:51发布

I'm building a Lucene Index and adding Documents.

I have a field that is multi-valued, for this example I'll use Categories.

An Item can have many categories, for example, Jeans can fall under Clothing, Pants, Men's, Women's, etc.

When adding the field to a document, do commas make a difference? Will Lucene simply ignore them? if I change commas to spaces will there be a difference? Does this automatically make the field multi-valued?

String categoriesForItem = getCategories(); // returns "category1, category2, cat3" from a DB call

categoriesForItem = categoriesForItem.replaceAll(",", " ").trim(); // not sure if to remove comma

doc.add(new StringField("categories", categoriesForItem , Field.Store.YES)); // doc is a Document

Am I doing this correctly? or is there another way to create multivalued fields?

Any help/advice is appreciated.

标签: java lucene
2条回答
我只想做你的唯一
2楼-- · 2019-03-27 12:01

This would be a better way to index multiValued fields per document

String categoriesForItem = getCategories(); // get "category1, category2, cat3" from a DB call

String [] categoriesForItems = categoriesForItem.split(","); 
for(String cat : categoriesForItems) {
    doc.add(new StringField("categories", cat , Field.Store.YES)); // doc is a Document 
}

Whenever multiple fields with the same name appear in one document, both the inverted index and term vectors will logically append the tokens of the field to one another, in the order the fields were added.

Also during the analysis phase two different values will be seperated by a position increment via setPositionIncrementGap() automatically. Let me explain why this is needed.

Your field "categories" in Document D1 has two values - "foo bar" and "foo baz" Now if you were to do a phrase query "bar foo" D1 should not come up. This is ensure by adding an extra increment between two values of the same field.

If you yourself concatenate the field values and rely on the analyzer to split it into multiple values "bar foo" would return D1 which would be incorrect.

查看更多
手持菜刀,她持情操
3楼-- · 2019-03-27 12:01

If you use the StandardAnalyzer it is ok to have commas or spaces. But if you have another Analyzer, it depends.

Another way: You can have multiple times the same field with another category in it. Then I would recommend to use KeywordAnalyzer or let it be untokenized to have exact match of your category name.

查看更多
登录 后发表回答