Extracting article contents from PDF magazines

2019-08-29 11:16发布

问题:

First of all, I am not aiming for a specific development answer, but rather a development approach.

The problem that I am having, is I have a client with a enormous amount of articles in PDFs, about 150 articles in fifty pdfs per year for the last 20 years. All of these PDFs are compiled from Quark express, from people with macs (if that info matters). Every time a new pdf magazine is created, the web-development team copy and pastes (!) each article into a form on the internet (!), incl. title, content, keywords, references, authorname, etc. It usually takes about 3 full days for one guy to finish the job.

When I was working there (I am not anymore, this was nearly seven years ago), I speeded the process up three fold using a clipboard monitoring app, and some simple XML-based PHP scripts that interact with the server. All you needed to do then, was select text, CTRL+C, select some more text, CTRL+C, go to the app (ALT+TAB), press 'next article', and repeat this. But we, or mostly I, still spend about fifty days per year processing PDF magazines.

Now I'm seven years down the line, and I am about to speak to my old boss again, for friendly visiting reasons. I know they are still using my apps (!). But perhaps it is a nice idea to look into their problem back again, and see if I can suggest a coding project that could help them?

I have never used Quark Express, I only know that it is something similar as to MS Word, that's as far as my knowledge about the software goes. I am not extremely familiar with unencrypted, extracted PDF code/syntax.

In short: Does Quark Express have some specific compilation patterns, that can be used in the PDF scripts to extract articles? What 'intelligent' tools are there, that can 'learn' from similarly structured pdf pages, where the article contents are? Are there tools out there, like Quark Xpress modules of some sort, that can 'encapsulate' or 'mark' an article together, with an invisible reference tag, to make extraction a lot simpler for scripts?

The people creating these PDFs have been doing their job for the past 20 years, and unwilling to change their working flow, except for software updates. Any additional tool for them must not interfere with their workflow, or they will just refuse it.

I don't want code; but merely some descriptions of what you or other people perhaps have done with regards to other PDF extraction problems. The best answer would be a description of maybe several methods, or some references to a external links with case descriptions.

回答1:

Broad question, but at first sight my answer would be that - if you let them go as far as the PDF - you're making things very difficult already. If they are still using Quark XPress, there are far better ways to do this kind of thing and similar approaches are actually be used by quite a few publishers out there.

1) Look into generating both PDF and XML out of Quark XPress. It's fine that they don't want to change their ways but they have to create PDF out of Quark anyway; also generating XML is not a really big additional step. In fact (warning - affiliation!) there are tools who can make all of this into one step. You could write AppleScript for example to steer the process, but something like axaio MadeToPrint will automatically generate both the (correct) PDF and an XML file after people clicking "export".

2) Once you have the PDF and the XML of the same content, use the PDF for print (just as know) and then write some code to convert the XML into whatever you need on the web site. If the coding is done on the web site itself, you might not even need to tweak the XML coming out of Quark; simply make the site smart enough to pick up whatever bits and pieces are necessary.

Broad answer on a broad question; hope that was what you are looking for...