How can i use tesseract ocr(or any other free ocr)

2019-01-21 17:24发布

问题:

So what I heard after research is that the only solid free OCR options are either Tesseract or CuneiForm.

Now, the Tesseract docs are plain horrible, all they give you is a bunch of Visual Studio code (for me on Windows) and from there you are on your own in an ocean of their API. All you can do is use the exe that compiles then use it on a tiff image.

I was expecting at least short documentation that tells you how to pull their API call to use OCR at least for a small example but no, there's nothing like that in their docs.

CuneiForm: I downloaded it and "great" everything is in Russian. :(

Is it really hard for those guys to pull a small example instead they supply us with bunch of irrelevant info that probably 90% of people won't reach, how can you reach there without starting on small things and they explain none of it!

So I have bunch of API but how the hell am I supposed to use it if it's explained nowhere?... Maybe someone can offer me advice and a solution? I'm not asking for a miracle, just something small to show me how things work.

回答1:

You might have given up, but there may be some other who are still trying. So here is what you need to start with tesseract:

First of all you should read all the documentation about tesseract. You may find something useful is the wiki.

To start using the API(v 3.0.1, currently in trunk, read also the README and ChangeLog from trunk) you should check out the baseapi.h. The documentation of how to use the api is right there, a comment above each function.

For starters:

  • include baseapi.h & construct TessBaseAPI object
  • call Init()
  • Some optional like
    • change some params with the SetVariable() func. You can see all the params and their values if you print them in a file using PrintVariables() func.
    • change the segmentation mode with SetPageSegMode(). Tell tesseract what the image you are about to OCR represents - block or line of text, word or character.
  • SetImage()
  • GetUTF8Text()

(Again, that is just for starters.)

You can check the tesseract's community for alredy answerd questions or ask your own here.



回答2:

I'm digging into it .. so far I've generated DoxyGen code for it .. that's helping. Still reading all the docs though.

Some links that help me:

  • The dev google group is full of broken examples from desperate devs
  • A slightly old (v2.0) hacking tesseract how to

Any I downloaded the svn from google code: http://code.google.com/p/tesseract-ocr/

and made and installed it then used doxygen to generate my own API reference docs. Very useful.

The way I did it is:

  1. I used 'make install' and it put some stuff in /usr/include/tesseract
  2. I copied that dir to my home dir
  3. doxygen -g doxygen.conf; # To generate a doxygen file
  4. Go through the file it generates and set output dir and project name or whatever. I used 'doxy-dox' as my output dir
  5. doxygen -g doxygen.conf
  6. chromium-browser chromium-browser doxy-doc/html/index.html

Hope that helps a bit.



回答3:

I figured it out, if you are using visual studios 2010 and are using windows forms / designer you can add it easily this way with no issues

  1. add the following projects to your project ( i am warning you once, do not add the tesseract solution, or change any setting in the projects you add, unless you love to hate yourself )

    ccmain ccstruct ccutil classify cube cutil dict image libtesseract nutral_networks textord viewer wordrec

you can add the others but you don’t really want all that built into your project do you? naaa, build those separately

  1. go to your project properties and add libtesseract as a reference, you can now that it is visible as a project, this will make it so that your project builds fast without examining the millions of warnings within tesseract. [common properties]->[add reference]

  2. right click your project in the solution explorer and click project dependencies, make sure it is dependant on libtesseract or even all of them, it just means they build before your project.

  3. the tesseract 2010 visual studio projects contain a number of configuration settings aka release, release.dll, debug, debug.dll, it seems that the release.dll settings produce the right files. First, set the solution output to release.dll. Click your project properties. Then click configuration manager. If that is not available, do this, click the SOLUTION's properties in the solution tree and click configuration tab, you will see a list of projects and the associated configuration settings. You will notice your project is not set to release.dll even though the output is. If you took the second route you still need to click configuration manager. Then you can edit the settings, click new on your projects settings and call it release.dll...exactly the same as the rest of them and copy the settings from release. Do the same thing for Debug, so that you have a debug.dll name copied from debug settings. wheew...almost done

  4. Don’t try to change tesseracts settings to match yours....that wont work ....and when the new release comes out you wont be able to just "throw it in" and go. Accept the fact that in this state your new modes are Release.dll and Debug.dll. don’t stress out...you can go back when its is finished and remove the projects from your solution.

  5. Guess where the libraries and dll’s come out? in your project, you may or may not need to add the library directories. Some people say to dump all the headers into a single folder so they only need to add one folder to the includes but not me. I want to be able to delete the tesseract folder and reload it from the zips without extra work....and be fully ready to update in one move or restore it if I made a mess of the code. Its a bit of work and you can to it with code instead of the settings which is the way i do it, but you should include all the folders that contain header files within the 2010 tesseract project folder and leave them alone.

  6. there is no need to add any files to your project. just these lines of code..... I have included some additional code that converts from one foreign data set to the tiff friendly version with no need to save / load file. aren’t I nice?

  7. now you can fully debug in debug.dll and release.dll, once you have successfully built it into your project even once you can remove all the added projects and it will be peeerfect. no extra compiling or errors. fully debugable, all natural.

  8. If I remember right, I could not get around the fact I had to copy the files in 2008/lib/ into my projects release folder….darn it.

In my projects “functions.h” I put

#pragma comment (lib, "liblept.lib" )
#define _USE_TESSERACT_
#ifdef _USE_TESSERACT_
#pragma comment (lib, "libtesseract.lib" )
#include <baseapi.h>
#endif
#include <allheaders.h>

in my main project I put this in a class as a member:

tesseract::TessBaseAPI *readSomeNombers;

and of course I included “functions.h” somewhere

then I put this in my classes constructor:

readSomeNombers = new tesseract::TessBaseAPI();
readSomeNombers ->Init(NULL, "eng" );
readSomeNombers ->SetVariable( "tessedit_char_whitelist", "0123456789,." );

then I created this class member function: and a class member to serve as an output, don’t hate, I don’t like returning variables. Not my style. The memory for the pix does not need to be destroyed when used inside a member function this way I believe and my test suggest this is a safe way to call these functions. But by all means, you can do whatever.

void Gaara::scanTheSpot()
{
    Pix *someNewPix;
    char* outText;
    ostringstream tempStream;
    RECT tempRect;
    someNewPix = pixCreate( 200 , 40 , 32 );
    convertEasyBmpToPix( &scanImage, someNewPix, 87, 42 );

    readSomeNombers ->SetImage(someNewPix);
    outText = readSomeNombers ->GetUTF8Text();
    tempStream.str("");
    tempStream << outText;
    classMemeberVariable = tempStream.str();
//pixWrite( "test.bmp", someNewPix, IFF_BMP );
}

The object that has the information that I want to scan is in memory and is pointed to by &scanImage. It is from the “EasyBMP” library but that is not important.

Which I deal with in a function in “functions.h”/ “functions.cpp” by the way, i am doing a little extra processing here while i am in the loop, namely thinning the characters and making it black and white and reversing black and white which is unnecessary. At this phase in my development I am still looking for ways to improve the recognition. Though for my proposes this has not yielded bad data yet. My view is to use the default Tess data for simplicity. I am acting heuristically to solve a very complex problem.

void convertEasyBmpToPix( BMP *sourceImage, PIX *outputImage, unsigned startX, unsigned startY )
{
    int endX = startX + ( pixGetWidth( outputImage ) );
    int endY = startY + ( pixGetHeight( outputImage ) );
    unsigned destinationX;
    unsigned destinationY = 0;
    for( int yLoop = startY; yLoop < endY; yLoop++ )
    {
        destinationX = 0;
        for( int xLoop = startX; xLoop < endX; xLoop++ )
        {
            if( isWhite( &( sourceImage->GetPixel( xLoop, yLoop ) ) ) )
            {
                pixSetRGBPixel( outputImage, destinationX, destinationY, 0,0,0 );
            }
            else
            {
                pixSetRGBPixel( outputImage, destinationX, destinationY, 255,255,255 );
            }
            destinationX++;
        }
        destinationY++;
    }
}
bool isWhite( RGBApixel *image )
{
    if(
        //destination->SetPixel( x, y, source->GetPixel( xLoop, yLoop ) );
        ( image->Red   < 50 ) ||
        ( image->Blue  < 50 ) ||
        ( image->Green < 50 )
        )
    {
        return false;
    }
    else
    {
        return true;
    }
}

one thing I don't like is the way I declare the size of the pix outside the function. It seems if I try to do it within the function I have unexpected results....if the memory is allocated while inside it is destroyed when I leave.

g m a i l Certainly not my most elegant work but I also gutted the hell out of it for simplicity. Why I bother to share this I don't know. I should have kept it to myself. What is my name? Kage.Sabaku.No.Gaara

before i let you go i should mention the subtle differences between my windows form app and the default settings. namely i use "multi-byte" character set. project properties...and such..give a dog a bone, maybe a vote?

p.p.s. I hate to say it but I made one change to host.c if you use 64 bit you can do the same. Otherwise your on your own.....but my reason was a bit insane you don't have to

typedef unsigned int uinT32;
#if (_MSC_VER >= 1200)            //%%% vkr for VC 6.0
typedef _int64 inT64;
typedef unsigned _int64 uinT64;
#else
typedef long long int inT64;
typedef unsigned long long int uinT64;
#endif                           //%%% vkr for VC 6.0
typedef float FLOAT32;
typedef double FLOAT64;
typedef unsigned char BOOL8;


回答4:

Marko, I've tried to write a quick C++ app as well using Tesseract and ran into the same problems.

In a nutshell I found it confusing with little examples/docs, but I don't fault the product, heck, it's open source and the contributers are probably more interested in improving it than marketing.

You could try poking around at the source code and possibly spending time might get an understanding, but I can totally relate to your frustration.

Good luck!