I want to be able to read the content of pdf files. I need to do that with C on Linux.
The closer i can get to this was here but I think Haru can only create pdf and is not able to read them (not 100% sure).
PS: I only need the plain text from pdf
I want to be able to read the content of pdf files. I need to do that with C on Linux.
The closer i can get to this was here but I think Haru can only create pdf and is not able to read them (not 100% sure).
PS: I only need the plain text from pdf
Check out libpoppler. I've never used it work extracting text, just querying PDF attributes. It's pretty easy to use.
How well do you need to parse them? Just extracting strings should be relatively easy, fully accurate rendering is harder. Take a look at the source for evince or ghostscript?
This is for C++ but might be a good starting point for understanding PDF structure http://www.codeproject.com/KB/cpp/ExtractPDFText.aspx (sorry wrong link before)
Another possible, though I've never used it is VersyPDF. It claims to allow you to edit PDFs ... http://versypdf.sybrex-systems-ltd.qarchive.org/