Adding a document to the index in SOLR: Document c

2019-02-16 12:43发布

问题:

I am adding (by a Java program) for indexing, a document in SOLR index, but after add(inputDoc) method there is an exception. The log in solr web interface contains the following:

Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="text" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[99, 111, 112, 101, 114, 116, 105, 110, 97, 32, 105, 110, 102, 111, 114, 109, 97, 122, 105, 111, 110, 105, 32, 113, 117, 101, 115, 116, 111, 32]...', original message: bytes can be at most 32766 in length; got 226781
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:687)
    at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:359)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:318)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:239)
    at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:457)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1511)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:240)
    at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
    ... 40 more
Caused by: org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException: bytes can be at most 32766 in length; got 226781
    at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:284)
    at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:151)
    at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:663)
    ... 47 more

Please what should I do to solve this problem?

回答1:

I had the same problem as yours, finally I solved my problem. Please check the type of your "text" field, I suspect it must be "strings".

You can find it in the managed-schema of the core:

<field name="text" type="strings"/>

Or you can go to Solr Admin, access: http://localhost:8983/solr/CORE_NAME/schema/fieldtypes?wt=json and then search for "text", if it is something like the follow, you know you defined your "text" field as strings type:

  {
  "name":"strings",
  "class":"solr.StrField",
  "multiValued":true,
  "sortMissingLast":true,
  "fields":["text"],
  "dynamicFields":["*_ss"]},

Then my solution works for you, you can change the type from "strings" to "text_general" in managed-schema. (make sure type of "text" in schema.xml is also "text_general")

   <field name="text" type="text_general">

This will solve your problem. strings is string field, but text_general is text field.



回答2:

You probably met what is described in LUCENE-5472 [1]. There, Lucene throws an error if a term is too long. You could:

  • use (in index analyzer), a LengthFilterFactory [2] in order to filter out those tokens that don't fall withing a requested length range

  • use (in index analyzer), a TruncateTokenFilterFactory [3] for fixing the max length of indexed tokens

  • use a custom UpdateRequestProcessor, but this actually depends on your context

[1] https://issues.apache.org/jira/browse/LUCENE-5472
[2] https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LengthFilterFactory
[3] https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.TruncateTokenFilterFactory [4] https://wiki.apache.org/solr/UpdateRequestProcessor



标签: solr