Is there a way to construct a query such that it will identify/return documents where a particular field is duplicated. What I am looking for is the equivalent of this in SQL:
SELECT content, count(*) FROM documents GROUP BY content ORDER BY count(*) DESC
Basically, give me the documents where the content is the same. Everything I have found discusses how to toss out duplicates upon insertion, or how to get rid of them in the search results. I tried using FieldCollapsing, but I get the following error:
"error": {
"msg": "can not use FieldCache on a field which is neither indexed nor has doc values: content",
"code": 400
}
Based on the error I had assumed it failed because content
isn't indexed. I tried just playing with the grouping using another field that is indexed and not multi-valued which stores the documents URL, but I can't make heads-nor-tails of the resulting groups, especially the groupValue
. I can try to create a copy-field that is indexed, but I am not sure if this will give me what I am looking for, and my crawler takes more than 24hrs to crawl.
This can very easily be done in Solr.
First of, make sure you schema.xml is squared away, the field you will perform this operation on needs to be stored and indexed. The type of the field should be string (this will maintain the data as is, without tokenization).
Next, index your content and run a query for it, assuming the field name is field1.
q=*:*&facet=true&facet.field=field1&facet.mincount=1
You will get a response back of all values for field1 and a count of values that are the same.
Using facets will yield the required results. First you need to index your field content in solr with appropriate definition. eg :
<field indexed="true" multiValued="false" name="content" stored="true" type="string_ci"/>
where type basically is mapped as follows<fieldType class="solr.TextField" name="string_ci" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.TrimFilterFactory"/> </analyzer> </fieldType>
and for the facet query try the following
q=:&facet=true&facet.field=content&facet.mincount=-1&facet.sort=count
[https://wiki.apache.org/solr/SimpleFacetParameters][facet document]
https://wiki.apache.org/solr/SimpleFacetParameters
If you are trying to get duplicates in near unique fields with high cardinality, facets can be used with terms query
{!terms f=partid}partid1,partid2..N&facet=true&facet.field=partid&facet.limit=N&facet.mincount=2