Indexing Multiple documents and mapping to unique

2019-02-20 04:36发布

问题:

My use case is to index 2 files: metadata file and a binary PDF file to a unique solr id. Metadata file has content in form of XML file and some schema fields are mapped to elements in that XML file.

What I do: Extract content from PDF files(using pdftotext), process that content and retrieve specific information(example: PDF's first page/line has information about the medicine, research stage). Information retrieved(medicine/research stage) needs to be indexed and one should be able to search/sort/facet.

I can create a XML file with information retrieved(lets call this as metadata file). Now assuming my schema would be

<field name="medicine" type="text" stored="true" indexed="true"/>
<field name="researchStage". ../>

Is there a way to put this metadata file and the PDF file in Solr?

What I have tried:

  1. Based on a suggestion in archives, I zipped these files and gave to ExtractRequestHandler. I was able to put all the content in SOLR and make it searchable. But it appear as content of zip file.(I had to apply some patches to Solr Code base to make this work). But this is not sufficient as the content in metadata file is not mapped to field names. curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F "myfile=@file.zip"

  2. I tried to work with DataImportHandler(binURLdatasource). But I don't think I understand how it works. So could not go far.

  3. I thought of adding metadata tags to PDF itself. For this to work, ExtractrequestHandler should process this metadata. I am not sure of that either. So I tried "pdftk" to add metadata. Was not able to add custom tags to it. It only updates/adds title/author/keywords etc. Does anyone know similar unix tool.

If someone has tips, please share. I want to avoid creating 1 file(by merging PDF text + metadata file).

回答1:

Given a file record1234.pdf and metadata like:

<metadata>
<field1>value1</field1>
<field2>value2</field2>
<field3>value3</field3>
</metadata>

Do the programmatic equivalent of

curl "http://localhost:8983/solr/update/extract?
literal.id=record1234.pdf
&literal.field1=value1
&literal.field2=value2
&literal.field3=value3
&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_txt&boost.foo_txt=3&"  -F "tutorial=@tutorial.pdf"

Adapted from http://wiki.apache.org/solr/ExtractingRequestHandler#Literals .

This will create a new entry in the index containing the text output from Tika/Solr CEL as well as the fields you specify.

You should be able to perform these operations in your favorite language.


the content in metadata file is not mapped to field names

If they dont map to a predefined field, then use dynamic fields. For example you can set a *_i to be an integer field.

I want to avoid creating 1 file(by merging PDF text + metadata file).

That looks like programmer fatigue :-) But, do you have a good reason?



标签: pdf solr