I am trying to convert PDF to PDF/A.
Currently I can do this using OpenOffice pdf viewer plugin together with Jodconverter 2. But this is pretty cumbersome to do.
Does anybody know of any open source / free Java libraries I can use to do this?
I have found these open source libraries so far, but none of which has support for converting PDF to PDF/A
iText
gnujpdf
PDF Box
FOP
JFreeReport
PJX
JPedal
PDFjet
jPod
PDF Renderer
UPDATE
Seems like Apache FOP has ability to convert a document (not a PDF document though) to PDF/A
Converting from PDF to PDF/A
This is the answer to your question as originally phrased.
For a solution that does not involve potentially lossy re-rendering, take a look at http://www.opensubscriber.com/message/itext-questions@lists.sourceforge.net/8027900.html , it appears that Foris Zoltan was able to get something (not exhaustive, but possibly sufficient for most PDFs) going using iText without the overkill of re-rendering.
If Zoltan's solution is not acceptable/sufficient according to your requirements then you are stuck with re-rendering. You could stick with OpenOffice/JODConverter, or go for less overhead by preferably using GhostScript (the mother of them all), piping pdf2ps
back into PDF/A-enabled ps2pdf
.
Apache FOP
Other respondents have suggested Apache FOP, which in the context of PDF to PDF/A conversion has the following advantages and disadvantages:
- advantage: less "moving parts" than an OpenOffice/JODCOnverter combination (e.g. comparing in-process FOP with daemonized OO)
- disadvantage: you are responsible for converting from PDF to XSL-FO or otherwise rendering to FOP (more coding and/or integration work required of you), whereas OpenOffice/JODCOnverter and Ghostscript can require less additional coding.
However, if I am not mistaken, it appears that you are using PDF as an intermediate format, i.e. that what you are trying to achieve is XHTML to PDF to PDF/A conversion. By converting directly from XHTML to PDF/A the process will be faster, will use less resources (e.g. memory) and will not needlessly degrade output quality (as re-rendering solutions can) or require intimate knowledge of the PDF format (as Zoltan's solution does.)
In this case, directly converting from XHTML to PDF/A would be an ideal solution, either using iText directly (the example uses iTextSharp, a .Net port of iText, but it's the same for Java), or by using Apache FOP as others have suggested (which also uses iText internally when outputting to PDF, and although it is more bloated, inefficient and complicated to setup than using iText directly, it might produce better results than the iText example -- only one way to settle that, i.e. you have to try it out on a few of your XHTML files as samples. :) )
Seam PDF is just a convenience for projects that are using Seam. There is nothing that stops you from using Apache FOP with Seam in order to generate PDF files.
I have personally used Apache FOP to generate PDF/A files in a Web application and it works fine. As the link already given by Liggy says it is as simple as
userAgent.getRendererOptions().put("pdf-a-mode",
"PDF/A-1b");
So my suggestion is to use directly Apache FOP instead of dealing with conversion (which also has performance issues)
Update:
The Apache FOP website contains a list of examples on how to use it via Java code.
http://xmlgraphics.apache.org/fop/0.95/embedding.html
Here is a minimal command line application that converts XML to PDF
Another approach which deals specifically with XHTML (and not just XML) is to use the xhtml2fo stylesheet from Antenna.
This is an example:
http://blog.platinumsolutions.com/node/216
Just add the following two lines before the creation of the "FOP" object and you are good to go.
FOUserAgent foUserAgent = fopFactory.newFOUserAgent();
foUserAgent.getRendererOptions().put("pdf-a-mode","PDF/A-1b");
You mention Apache FOP in your list of APIs, but from this page - http://xmlgraphics.apache.org/fop/trunk/pdfa.html it mentions that there is some support for PDF/A:
PDF/A-1b is implemented to the degree that FOP supports the creation of the elements described in ISO 19005-1.
PDF/A-1a is based on PDF-A-1b and adds accessibility features (such as Tagged PDF). This format is available within the limitation described on the Accessibility page.
It doesn't specifically mention anything about PDF to PDF/A, but it might possibly be an open source alternative.
There's a project hosted in gitHub pdf2htmlEX worth a look . it's open source writen in C++ .
We just released jPDFPreflight, a Java library that can convert PDF files to PDF/A. There are some restrictions in this first version of the type of documents that can be converted.