可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a pdf
file including form fields and need to export the data into a xml
file AUTOMATICALLY. Here is a screen of a sample form I created for testing:
Note: It works great exporting it MANUALLY using Acrobat Professional by clicking on Tools > Form > Export Form Data
and finally chose xml extension for file output. This is the result I'm getting when I export it manually:
<?xml version="1.0" encoding="UTF-8"?>
<fields>
<first_name>John</first_name>
<last_name>Doe</last_name>
</fields>
However, I need to automate it, e.g. with a python script, Java implementation or some command line tools. Any ideas which libraries or tools I could use to export form field data to xml
? The tool or library should be open source, that I can integrate it in my workflow.
I already tried python pdfminer
library, which helped me to export static parts (like Static form header
, First name:
and Last name:
) of the pdf file: But how to export form field data (in my case the content of the form fields first_name
and last_name
)??
EDIT: Feel free to download the sample.pdf file here.
回答1:
How about Apache PDFBox? It is open source and could fit your needs, since the website says "Extract forms data from PDF forms or prefill a PDF form."
EDIT: Check out the PrintFields example.
回答2:
In bash, you can do this (at least with my version of these tools, less 444 and cat 8.13):
less ~/Downloads/sample.pdf | cat
I get output that looks like this:
Static form header
First name: John
Last name: Doe
Which you can then parse pretty obviously using Java/Python/awk/whatever.
Of course, alternatively, if you don't want to rely on the behavior of particular versions of these (not sure if they always do this or not), you can look up less's source code to see how it does it.
回答3:
In Java there is a few libraries to work with PDF, but generally it's hard to get formatted information from PDF. I have never implemented that thing, but Qoppa looks good and seems to be advanced but it's not free. It contains jPDFFields which should be useful to extract values from form fields.
Also there is a similar thread, in which there is some information about the command line tool.
I hope it will be helpful for you.
回答4:
I had much success using pdfminer:
pdf2txt.py -o out.xml -t xml sample.pdf
and then parse it using xpath and join strings, to use it from your code track the code here
other than that there is a new kid on the block called tabula, written in ruby which I didnt get the chance to use yet but supposed to be great
I understand your unwilling to use paid service, but still worth mentioning that Adobe have a conversion service that at the time of writing costs 2$ a month, check it out, just saying...
回答5:
For a Java solution, you could use iText to read the fields and then something like jackson-dataformat-xml to write the results as XML. A, somewhat basic, example of this would be:
// read fields
final PdfReader reader = new PdfReader("/path/to/my.pdf");
final AcroFields fields = reader.getAcroFields();
final Map<String, Object> values = new HashMap<>();
for (String fieldName : (Set<String>) fields.getFields().keySet()) {
values.put(fieldName, fields.getField(fieldName));
}
// write
final XmlMapper mapper = new XmlMapper();
final String result = mapper.writeValueAsString(values);
System.out.println(result);
There is definitely some room for improvement here, but it may be a good enough starting point.