We are developing a Pdf parser to be used along with our system.
The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document).
We did some googling and found iTextSharp be the best mate for our purpose.
We are developing our project using .net.
You might have guessed as i mentioned in my title requiring comparisons for specific versions of iTextSharp (4.1.6 vs 5.x). We know that 4.1.6 is the last version of iTextSharp with the LGPL/MPL license . The 5.x versions are AGPL.
We would like to have a good comparison between the versions before choosing the LGPL version or we buy the license for AGPL (we dont like to publish our code).
I did some browsing through the revision changes in the iTextSharp but i would like to know if any content exist, making a good comparison between the versions.
Thanks in advance!
I'm the CTO of iText Software, so just like Michaël who already answered in the comment section, I'm at the same time the most authoritative source as well as a biased source.
There's a very simple comparison chart on the iText web site: http://itextpdf.com/functionalitycomparison
This chart doesn't cover text extraction, so allow me to list the relevant improvements since iText 5.
You've probably also found this page: http://itextpdf.com/salesfaq
In case you wonder about the bug fixes and the performance improvements regarding text parsing, this is a more exhaustive list:
- 5.0.0: Text extraction: major overhaul to perform calculations in user space. This allows the parser to correctly determine line breaks, even if the text or page is rotated.
- 5.0.1: Refactored callback so method signature won't need to change as render callback API evolves.
- 5.0.1: Refactoring to make it easier for outside users to interact with the content stream processor. Also refactored render listener so text and image event listening occurs in the same interface (reduces a lot of non-value-add complexity)
- 5.0.1: New filtering functionality for text renderers.
- 5.0.1: Additional utility method for previewing pdf content.
- 5.0.1: Added a much more advanced text renderer listener that can reconstruct page content based on physical location of text on the page
- 5.0.1: Added support for XObject Form processing (text added via PdfTemplate can now be parsed)
- 5.0.1: Added rudimentary support for XObject Image callbacks
- 5.0.1: Bug fix - text extraction wasn't correct for certain page orientations
- 5.0.1: Bug fix - matrices were being concatenated in the wrong order.
- 5.0.1: PdfTextExtractor: changed the default render listener (new location aware strategy)
- 5.0.1: Getters for GraphicsState
- 5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
- 5.0.2: CMapAwareDocumentFont: Tweaks to make processing quasi-invalid PDF files more robust
- 5.0.2: PdfContentReaderTool: null pointer handling, plus a few well placed flush calls
- 5.0.2: PdfContentReaderTool: Show details on resource entries
- 5.0.2: PdfContentStreamProcessor: Adjustment so embedded images don't cause parsing problems and improvements to EI detection
- 5.0.2: LocationTextExtractionStrategy: Fixed anti-parallel algorithm, plus accounting for negative inter-character offsets. Change to text extraction strategy that builds out the text model first, then computes concatenation requirements.
- 5.0.2: Adjustments to linesegment implementation; optimalization of changes made by Bruno to text extraction; for example: introduction of the class MarkedContentInfo.
- 5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
- 5.0.3: added method to get area of image in user units
- 5.0.3: better parsing of inline images
- 5.0.3: Adding an extra check for begin/end sequences when parsing a ToUnicode stream.
- 5.0.4: Content streams in arrays should be parsed as if they were separated by whitespace
- 5.0.4: Expose CTM
- 5.0.4: Refactor to pull inline image processing into it's own class. Added parsing of image data if there is no filter applied (there are some PDFs where there is no white space between the end of the image data and the EI operator). Ultimately, it will be best to actually parse the image data, but this will require a pretty big refactoring of the iText decoders (to work from streams instead of byte[] of known lengths).
- 5.0.4: Handle multi-stage filters; Correct bug that pulled whitespace as first byte of inline image stream.
- 5.0.4: Applying stream filters to inline images.
- 5.0.4: PdfReader: Expose filter decoder for arbitrary byte arrays (instead of only streams)
- 5.0.6: CMapParser: Fix to read broken ToUnicode cmaps.
- 5.0.6: handle slightly malformed embedded images
- 5.0.6: CMapAwareDocumentFont: Some PDFs have a diff map bigger than 256 characters.
- 5.0.6: performance: Cache the fonts used in text extraction
- 5.1.2: PRTokeniser: Made the algorithm to find startxref more memory efficient.
- 5.1.2: RandomAccessFileOrArray: Improved handling for huge files that can't be mapped
- 5.1.2: CMapAwareDocumentFont: fix NPE if mapping doesn't get initialized (I'd rather wind up with junk characters than throw an unexpected exception down the road)
- 5.1.3: refactoring of how filters are applied to streams, adjust parser so it can handle multi-stage filters
- 5.1.3: images: allow correct decoding of 1bpc bitmask images
- 5.1.3: images: add jbig2 streams to pass through
- 5.1.3: images: handle null and indirect references in decode parameters, throw exception if unable to decode an image
- 5.2.0: Better error messages and better handling zero sized files and attempts to read past the end of the file.
- 5.2.0: Removed restriction that using memory mapping requires the file be smaller than ~2GB.
- 5.2.0: Avoid NullPointerException in RandomAccessFileOrArray
- 5.2.0: Made a utility method in pdfContentStreamProcessor private and clarified the stateful nature of the class
- 5.2.0: LocationTextExtractionStrategy: bounds checking on string lengths and refactoring to make code easier to read.
- 5.2.0: Better handling of color space dictionaries in images.
- 5.2.0: improve handling of quasi improper inline image content.
- 5.2.0: don't decode inline image streams until we absolutely need them.
- 5.2.0: avoid NullPointerException of resource dictionary isn't provided.
- 5.3.0: LocationTextExtractionStrategy: old comparison approach caused runtime exceptions in Java 7
- 5.3.3: incorporate the text-rise parameter
- 5.3.3: expose glyph-by-glyph information
- 5.3.3: Bugfix: text to user space transformation was being applied multiple times for sub-textrenderinfo objects
- 5.3.3: Bugfix: Correct baseline calculation so it doesn't include final character spacing
- 5.3.4: Added low-level filtering hook to LocationTextExtractionStrategy.
- 5.3.5: Fixed bug in PRTokeniser: handle case where number is at end of stream.
- 5.3.5: Replaced StringBuffer with StringBuilder in PRTokeniser for performance reasons.
- 5.4.2: Added an isChunkAtWordBoundary() method to LocationTextExtractionStrategy to check if a space character should be inserted between a previous chunk and the current one.
- 5.4.2: Added a getCharSpaceWidth() method to LocationTextExtractionStrategy to get the width of a space character.
- 5.4.2: Added a getText() method to LocationTextExtractionStrategy to get the text of the current Chunk.
- 5.4.2: Added an appendTextChunk(() method to SimpleTextExtractionStrategy to expose the append process so that subclasses can add text from outside the text parse operation.
- 5.4.5: Added MultiFilteredRenderListener class for PDF parser.
- 5.4.5: Added GlyphRenderListener and GlyphTextRenderListener classes for processing each glyph rather than processing chunks of text.
- 5.4.5: Added method getMcid() in TextRenderInfo.
- 5.4.5: fixed resource leak when many inline images were in content stream
- 5.5.0: CMapAwareDocumentFont: if font space width isn't defined, use the default width for the font.
- 5.5.0: PdfContentReader: avoid exception when displaying an empty dictionary.
There are some things that you won't be able to do if you don't upgrade. For instance, you won't be able to do the things described in these slides: http://www.slideshare.net/iTextPDF/itext-summit-2014-talk-unstructured-pdf
If you look at the roadmap for iText, you'll see that we'll invest even more time on text extraction in the future: http://www.slideshare.net/iTextPDF/itext-summit-2014-keynote-talk
In all honesty: using the 5 year old version wouldn't only be like reinventing the wheel, it would also be like falling in every pitfall we've fallen in in the last 5 years. I can assure you that buying a license will be less expensive.