Finding duplicate values in Solr

Is there a way to construct a query such that it will identify/return documents where a particular field is duplicated. What I am looking for is the equivalent of this in SQL:

SELECT content, count(*) FROM documents GROUP BY content ORDER BY count(*) DESC

Basically, give me the documents where the content is the same. Everything I have found discusses how to toss out duplicates upon insertion, or how to get rid of them in the search results. I tried using FieldCollapsing, but I get the following error:

"error": {
  "msg": "can not use FieldCache on a field which is neither indexed nor has doc values: content",
  "code": 400
}

Based on the error I had assumed it failed because content isn't indexed. I tried just playing with the grouping using another field that is indexed and not multi-valued which stores the documents URL, but I can't make heads-nor-tails of the resulting groups, especially the groupValue. I can try to create a copy-field that is indexed, but I am not sure if this will give me what I am looking for, and my crawler takes more than 24hrs to crawl.

标签： solr solr4

3条回答

你好瞎i

2楼-- · 2019-09-17 09:26

This can very easily be done in Solr.

First of, make sure you schema.xml is squared away, the field you will perform this operation on needs to be stored and indexed. The type of the field should be string (this will maintain the data as is, without tokenization).

Next, index your content and run a query for it, assuming the field name is field1.

q=*:*&facet=true&facet.field=field1&facet.mincount=1

You will get a response back of all values for field1 and a count of values that are the same.

0人赞添加讨论(0) 举报

Rolldiameter

3楼-- · 2019-09-17 09:43

Using facets will yield the required results. First you need to index your field content in solr with appropriate definition. eg :

<field indexed="true" multiValued="false" name="content" stored="true" type="string_ci"/> where type basically is mapped as follows

<fieldType class="solr.TextField" name="string_ci" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.TrimFilterFactory"/> </analyzer> </fieldType>

and for the facet query try the following q=:&facet=true&facet.field=content&facet.mincount=-1&facet.sort=count

[https://wiki.apache.org/solr/SimpleFacetParameters][facet document]

https://wiki.apache.org/solr/SimpleFacetParameters

0人赞添加讨论(0) 举报

倾城　Initia

4楼-- · 2019-09-17 09:48

If you are trying to get duplicates in near unique fields with high cardinality, facets can be used with terms query

{!terms f=partid}partid1,partid2..N&facet=true&facet.field=partid&facet.limit=N&facet.mincount=2

0人赞添加讨论(0) 举报

Finding duplicate values in Solr

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间