Batch OCR Program for PDFs [closed]

2019-01-16 13:13发布

This has been asked before, but I don't really know if the answers help me. Here is my problem: I got a bunch of (10,000 or so) pdf files. Some were text files that were saved using adobe's print feature (so their text is perfect and I don't want to risk screwing them up). And some were scanned images (so they don't have any text and I will have to settle for OCR). The files are in the same directory and I can't tell which is which. Ultimately I want to turn them into .txt files and then do string processing on them. So I want the most accurate OCR possible.

It seems like people have recommended:

  1. adobe pdf (I don't have a licensed copy of this so ... plus if ABBYY finereader or something is better, why pay for it if I won't use it)
  2. ocropus (I can't figure out how to use this thing),
  3. Tesseract (which seems like it was great in 1995 but I'm not sure if there's something more accurate plus it doesn't do pdfs natively and I've have to convert to TIFF. that raises its own problem as I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff. plus i don't want 10,000 30 page documents turned into 30,000 individual tiff images).
  4. wowocr
  5. pdftextstream (that was from 2009)
  6. ABBYY FineReader (apparently its' $$$, but I will spend $600 to get this done if this thing is significantly better, i.e. has more accurate ocr).

Also I am a n00b to programming so if it's going to take like weeks to learn how to do something, I would rather pay the $$$. Thx for input/experiences.

BTW, I'm running Linux Mint 11 64 bit and/or windows 7 64 bit.

Here are the other threads:

Batch OCRing PDFs that haven't already been OCR'd

Open source OCR

PDF Text Extraction Approach Using OCR

https://superuser.com/questions/107678/batch-ocr-for-many-pdf-files-not-already-ocred

5条回答
倾城 Initia
2楼-- · 2019-01-16 13:38

Just to put some of your misconceptions straight...

" I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff."

You can convert PDFs to TIFF with the help of Free (as in liberty) and free (as in beer) Ghostscript. Your choice if you want to do it on Linux Mint or on Windows 7. The commandline for Linux is:

gs \
 -o input.tif \
 -sDEVICE=tiffg4 \
  input.pdf

"i don't want 10,000 30 page documents turned into 30,000 individual tiff images"

You can have "multipage" TIFFs easily. Above command does create such TIFFs of the G4 (fax tiff) flavor. Should you even want single-page TIFFs instead, you can modify the command:

gs \
 -o input_page_%03d.tif \
 -sDEVICE=tiffg4 \
  input.pdf

The %03d part of the output filename will automatically translate into a series of 001, 002, 003 etc.

Caveats:

  1. The default resolution for the tiffg4 output device is 204x196 dpi. You probably want a better value. To get 720 dpi you should add -r720x720 to the commandline.
  2. Also, if your Ghostscript installation uses letter as its default media size, you may want to change it. You can use -gXxY to set widthxheight in device points. So to get ISO A4 output page dimensions in landscape you can add a -g8420x5950 parameter.

So the full command which controls these two parameters, to produce 720 dpi output on A4 in portrait orientation, would read:

gs \
 -o input.tif \
 -sDEVICE=tiffg4 \
 -r720x720 \
 -g5950x8420 \
  input.pdf
查看更多
Luminary・发光体
3楼-- · 2019-01-16 13:45

I have written a small wrapper for Abbyy OCR4LINUX CLI engine (IMHO, doesn't cost that much) and Tesseract 3.

The wrapper can batch convert files like:
$ pmocr.sh --batch --target=pdf --skip-txt-pdf /some/directory

The script uses pdffonts to determine if a PDF file has already been OCRed to skip them. Also, the script can work as system service to monitor a directory and launch an OCR action as soon as a file enters the directory.

Script can be found here:
https://github.com/deajan/pmOCR

Hopefully, this helps someone.

查看更多
Emotional °昔
4楼-- · 2019-01-16 13:47

The simplest approach would be to use a single tool such a ABBYY FineReader, Omnipage etc to process the images in one batch without having to sort them out into scanned vs not scanned images. I believe FineReader converts the PDF's to images before performing OCR anyway.

Using an OCR engine will give you features such as automatic deskew, page orientation detection, image thresholding, despeckling etc. These are features you would have to buy an image processng library for and program yourself and it could prove difficult to find an optimal set of parameters for your 10,000 PDF's.

Using the automatic OCR approach will have other side effects depending on the input images and you would find you would get better results if you sorted the images and set optimal parameters for each type of images. For accuracy it would be much better to use a proper PDF text extraction routine to extract the PDF's that have perfect text.

At the end of the day it will come down to time and money versus the quality of the results that you need. At the end of the day, a commercial OCR program will be the quickest and easiest solution. If you have clean text only documents then a cheap OCR program will work as well as an expensive solution. The more complex your documents, the more money you will need to spend to process them.

I would try finding some demo / trial versions of commercial OCR engines and just see how they perform on your different document types before spending too much time and money.

查看更多
女痞
5楼-- · 2019-01-16 13:48

This is an interesting problem. If you are willing to work on Windows in .NET, you can do this with dotImage (disclaimer, I work for Atalasoft and wrote most of the OCR engine code). Let's break the problem down into pieces - the first is iterating over all your PDFs:

string[] candidatePDFs = Directory.GetFiles(sourceDirectory, "*.pdf");
PdfDecoder decoder = new PdfDecoder();

foreach (string path in candidatePDFs) {
    using (FileStream stm = new FileStream(path, FileMode.Open)) {
        if (decoder.IsValidFormat(stm)) {
            ProcessPdf(path, stm);
        }
    }
}

This gets a list of all files that end in .pdf and if the file is a valid pdf, calls a routine to process it:

public void ProcessPdf(string path, Stream stm)
{
    using (Document doc = new Document(stm)) {
        int i=0;
        foreach (Page p in doc.Pages) {
            if (p.SingleImageOnly) {
                ProcessWithOcr(path, stm, i);
            }
            else {
                ProcessWithTextExtract(path, stm, i);
            }
            i++;
        }
    }
}

This opens the file as a Document object and asks if each page is image only. If so it will OCR the page, else it will text extract:

public void ProcessWithOcr(string path, Stream pdfStm, int page)
{
    using (Stream textStream = GetTextStream(path, page)) {
        PdfDecoder decoder = new PdfDecoder();
        using (AtalaImage image = decoder.Read(pdfStm, page)) {
            ImageCollection coll = new ImageCollection();
            coll.Add(image);
            ImageCollectionImageSource source = new ImageCollectionImageSource(coll);
            OcrEngine engine = GetOcrEngine();
            engine.Initialize();
            engine.Translate(source, "text/plain", textStream);
            engine.Shutdown();
        }
    }
}

what this does is rasterizes the PDF page into an image and puts it into a form that is palatable for engine.Translate. This doesn't strictly need to be done this way - one could get an OcrPage object from the engine from an AtalaImage by calling Recognize, but then it would be up to client code to loop over the structure and write out the text.

You'll note that I've left out GetOcrEngine() - we make available 4 OCR engines for client use: Tesseract, GlyphReader, RecoStar, and Iris. You would select the one that would be best for your needs.

Finally, you would need the code to extract text from the pages that already have perfectly good text on them:

public void ProcessWithTextExtract(string path, Stream pdfStream, int page)
{
    using (Stream textStream = GetTextStream(path, page)) {
        StreamWriter writer = new StreamWriter(textStream);
        using (PdfTextDocument doc = new PdfTextDocument(pdfStream)) {
            PdfTextPage page = doc.GetPage(i);
            writer.Write(page.GetText(0, page.CharCount));
        }
    }
}

This extracts the text from the given page and writes it to the output stream.

Finally, you need GetTextStream():

public Stream GetTextStream(string sourcePath, int pageNo)
{
    string dir = Path.GetDirectoryName(sourcePath);
    string fname = Path.GetFileNameWithoutExtension(sourcePath);
    string finalPath = Path.Combine(dir, String.Format("{0}p{1}.txt", fname, pageNo));
    return new FileStream(finalPath, FileMode.Create);
}

Will this be a 100% solution? No. Certainly not. You could imagine PDF pages that contain a single image with a box draw around it - this would clearly fail the image only test but return no useful text. Probably, a better approach is to just use the extracted text and if that doesn't return anything, then try an OCR engine. Changing from one approach to the other is a matter of writing a different predicate.

查看更多
劫难
6楼-- · 2019-01-16 14:02

Figured I would try to contribute by answering my own question (have written some nice code for myself and could not have done it without help from this board). If you cat the pdf files in unix (well, osx for me), then the pdf files that have text will have the word "Font" in them (as a string, but mixed in with other text) b/c that's how the file tells Adobe what fonts to do display.

The cat command in bash seems to have the same output as reading the file in binary mode in python (using 'rb' mode when opening file instead of 'w' or 'r' or 'a'). So I'm assuming that all pdf files that contain text with have the word "Font" in the binary output and that no image-only files ever will. If that's always true, then this code will make a list of all pdf files in a single directory that have text and a separate list of those that have only images. It saves each list to a separate .txt file, then you can use a command in bash to move the pdf files to the appropriate folder.

Once you have them in their own folders, then you can run your batch ocr solution on just the pdf files in the images_only folder. I haven't gotten that far yet (obviously).

    import os, re

    #path is the directory with the files, other 2 are the names of the files you will store your lists in

    path = 'C:/folder_with_pdfs'
    files_with_text = open('files_with_text.txt', 'a')
    image_only_files = open('image_only_files.txt', 'a')


    #have os make a list of all files in that dir for a loop
    filelist = os.listdir(path)

    #compile regular expression that matches "Font"
    mysearch = re.compile(r'.*Font.*', re.DOTALL)

    #loop over all files in the directory, open them in binary ('rb'), search that binary for "Font"
    #if they have "Font" they have text, if not they don't
    #(pdf does something to understand the Font type and uses this word every time the pdf contains text)
    for pdf in filelist:
        openable_file = os.path.join(path, pdf)
        cat_file = open(openable_file, 'rb')
        usable_cat_file = cat_file.read()
        #print usable_cat_file
        if mysearch.match(usable_cat_file):
            files_with_text.write(pdf + '\n')
        else:
            image_only_files.write(pdf + '\n')

To move the files, I entered this command in bash shell:

cat files_with_text.txt | while read i; do mv $i Volumes/hard_drive_name/new_destination_directory_name; done 

Also, I didn't re-run the python code above, I just hand-edited the thing, so it might be buggy, Idk.

查看更多
登录 后发表回答