solr multiple pdf files indexing all at once.

2019-07-28 08:24发布

问题:

Using this command

curl '://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F "myfile=@maven_tutorial.pdf"

we can index single pdf files,by specifying our own id(DOC1), in solr. But I want to index many pdf files to solr all at once. let solr keep track of id automatically.

Please help me.

回答1:

You can use UUID type field as unique key. First define the UUID field type

<fieldType name="uuid" class="solr.UUIDField" indexed="true" />

Add your id field in the schema.xml

<field name="id" type="uuid" indexed="true" stored="true"  multiValued="false"/>

Make this field as the unique key

<uniqueKey>id</uniqueKey>

In solrconfig.xml update the chain for autogenerating the id

<updateRequestProcessorChain name="uuid">
<updateRequestProcessorChain name="uuid">
    <processor class="solr.UUIDUpdateProcessorFactory">
        <str name="fieldName">id</str>
    </processor>
    <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

Now attach this update chain to the request handler which is extracting the content from the pdf files that you are submitting to solr.

<requestHandler name="/update/extract" 
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <str name="lowernames">true</str>
  <str name="uprefix">ignored_</str>
  <str name="captureAttr">true</str>
  <str name="fmap.a">links</str>
  <str name="fmap.div">ignored_</str>
  <str name="update.chain">uuid</str>
</lst>