Perl Image::OCR::Tesseract module on Windows

Anyone out there know of a graceful way to install the "Image::OCR::Tesseract" module on Windows? The module fails to install on Windows via CPAN due to a *NIX only module dependency called "LEOCHARRE::CLI". This module does not seem to be required to run "Image::OCR::Tesseract" itself.

I've managed to get the module working by first manually installing the dependency modules listed in the makefile.pl (except for "LEOCHARRE::CLI") and then by moving the module file to the correct directory structure under "C:\Perl\site\lib\Image\OCR". The final part of getting it to work was to alter the section of code that calls the ImageMagick and Tesseract executables from the command line to put quotes around the program names when the executables are called by module.

This works, but I'd really feel better about doing a PPM or CPAN install on a production system from a repo that works on Windows.

Never mind, I got it, though I can't decide what is the better solution.

To get the installer to work on Windows via the traditional "perl makefile.pl, make, make test, make install" routine requires an edit to the Makefile.pl script, including the missing Windows install module (Devel::AssertOS::MSWin32), and patch to AssertEXE.pm to use "File::Which" rather than the built in shell "which" command that Windows lacks. All this still requires that The "Image::OCR::Tesseract" be patched to put quotes around program names when executing "convert" and "tesseract" from the command line.

Given the number of steps involved to make the installer work on Windows, and the fact the module does not create a binary component for the module to link to, I'd say the best option for installing and getting the Tesseract module working on windows would be to first install the following binary packages:

ImageMagick http://www.imagemagick.org/script/binary-releases.php

Tesseract http://code.google.com/p/tesseract-ocr/downloads/list

Next, locate your Perl module directory - on my system it is "C:\Perl\site\lib\". Create a folder "Image", if you don't have one. Next, open the Image folder and create a folder called "OCR". Open the OCR folder. At this point, your path should be something along the lines of "C:\Perl\site\lib\Image\OCR\". Create a new text file called "Tesseract.pm", and copy in the following content...

package Image::OCR::Tesseract;
use strict;
use Carp;
use Cwd;
use String::ShellQuote 'shell_quote';
use Exporter;
use vars qw(@EXPORT_OK @ISA $VERSION $DEBUG $WHICH_TESSERACT $WHICH_CONVERT %EXPORT_TAGS @TRASH);
@ISA = qw(Exporter);
@EXPORT_OK = qw(get_ocr get_hocr _tesseract convert_8bpp_tif tesseract);
$VERSION = sprintf "%d.%02d", q$Revision: 1.24 $ =~ /(\d+)/g;
%EXPORT_TAGS = ( all => \@EXPORT_OK );


BEGIN {
   use File::Which 'which';
   $WHICH_TESSERACT = which('tesseract');
   $WHICH_CONVERT   = which('convert');

   if($^O=~m/MSWin/) {
      $WHICH_TESSERACT='"'.$WHICH_TESSERACT.'"';
      $WHICH_CONVERT='"'.$WHICH_CONVERT.'"';
   }
   $WHICH_TESSERACT or die("Is tesseract installed? Cannot find bin path to tesseract.");
   $WHICH_CONVERT or die("Is convert installed? Cannot find bin path to convert.");
}

END {
   scalar @TRASH or return;
   if ( $DEBUG ){
      print STDERR "Debug on, these are trash files:\n".join("\n",@TRASH) ;
   }
   else {
      unlink @TRASH;
   }
}

sub DEBUG { Carp::cluck("Image::OCR::Tesseract::DEBUG() deprecated") }

sub get_hocr {
   my ($abs_image,$abs_tmp_dir,$lang)= @_;
   -f $abs_image or croak("$abs_image is not a file on disk");
   my $hocr="hocr";
   if(defined $abs_tmp_dir){

      -d $abs_tmp_dir or die("tmp dir arg $abs_tmp_dir not a dir on disk.");

      $abs_image=~/([^\/]+)$/ or die("cant match filename in path arg '$abs_image'");
      my $abs_copy = "$abs_tmp_dir/$1";

      # TODO, what if source and dest are same, i want it to die
      require File::Copy;
      File::Copy::copy($abs_image, $abs_copy) 
         or die("cant make copy of $abs_image to $abs_copy, $!");

      # change the image to get ocr from to be the copy
      $abs_image = $abs_copy;
      # since it's a copy. erase that on exit
      push @TRASH, $abs_image;      
   }

   my $tmp_tif = convert_8bpp_tif($abs_image);

   push @TRASH, $tmp_tif; # for later delete

   _tesseract($tmp_tif,$lang,$hocr) || '';
}

sub get_ocr {
   my ($abs_image,$abs_tmp_dir,$lang)= @_;
   -f $abs_image or croak("$abs_image is not a file on disk");
   if(defined $abs_tmp_dir){

      -d $abs_tmp_dir or die("tmp dir arg $abs_tmp_dir not a dir on disk.");

      $abs_image=~/([^\/]+)$/ or die("cant match filename in path arg '$abs_image'");
      my $abs_copy = "$abs_tmp_dir/$1";

      # TODO, what if source and dest are same, i want it to die
      require File::Copy;
      File::Copy::copy($abs_image, $abs_copy) 
         or die("cant make copy of $abs_image to $abs_copy, $!");

      # change the image to get ocr from to be the copy
      $abs_image = $abs_copy;
      # since it's a copy. erase that on exit
      push @TRASH, $abs_image;      
   }

   my $tmp_tif = convert_8bpp_tif($abs_image);

   push @TRASH, $tmp_tif; # for later delete

   _tesseract($tmp_tif,$lang) || '';
}

sub convert_8bpp_tif {
   my ($abs_img,$abs_out) = (shift,shift);
   defined $abs_img or die('missing image arg');

   $abs_out ||= $abs_img.'.tmp.'.time().(int rand(9000)).'.tif';

   my @arg = ( $WHICH_CONVERT, $abs_img, '-compress','none','+matte', $abs_out );

   #die (join(" ", @arg));

   system(@arg) == 0 or die("convert $abs_img error.. $?");

   $DEBUG and warn("made $abs_out 8bpp tiff.");
   $abs_out;
}



# people expect tesseract to automatically convert

*tesseract = \&_tesseract;
sub _tesseract {
    my ($abs_image,$lang,$hocr) = @_;
   defined $abs_image or croak('missing image path arg');

   $abs_image=~/\.tif+$/i or warn("Are you sure '$abs_image' is a tif image? This operation may fail.");

   #my @arg = (
   #   $WHICH_TESSERACT, shell_quote($abs_image), shell_quote($abs_image), 
   #   (defined $lang and ('-l', $lang) ), '2>/dev/null'
   #); 

   my $cmd = 
      ( sprintf '%s %s %s', 
         $WHICH_TESSERACT, 
         shell_quote($abs_image), 
         shell_quote($abs_image) 
      ) .
      ( defined $lang ? " -l $lang" : '' ) .
      ( defined $hocr ? " hocr" : '' ) .
      "  2>/dev/null";
   $DEBUG and warn "command: $cmd";

    system($cmd); # hard to check ==0 

    my $txt = $abs_image.($hocr?".html":".txt");
   unless( -f $txt ){      
        Carp::cluck("no text output for image '$abs_image'. (No text file '$txt' found on disk)");
      return;
   }

    $DEBUG and warn "Found text file '$txt'";

   my $content = (_slurp($txt) || '');   
   $DEBUG and warn("content length of text in '$txt' from image '$abs_image' is ". length $content );
   push @TRASH, $txt;

   $content;
}

sub _slurp {
   my $abs = shift;
   open(FILE,'<', $abs) or die("can't open file for reading '$abs', $!");
   local $/;
   my $txt = <FILE>;
   close FILE;
   $txt;
}  

1;


__END__

#sub _force_imgtype {
#   my $img = shift;
#   my $type = shift;
#   my $delete_original = shift;
#   $delete_original ||=0;
#   
#
#   if($img=~/\.$type$/i){
#      return $img;
#   }
#
#   my $img_out= $img;
#   $img_out=~s/\.\w{1,5}$/\.$type/ or die("cant get file ext for $img");
#
#
#
#}

Save and close. Close the command line session and open a new one if you've had one open from before you did the ImageMagick and Tesseract binary installs. Test the module with the following script:

use Image::OCR::Tesseract;
my $image = 'SomeImageFileThatContainsText.jpg';

my $text = Image::OCR::Tesseract::get_ocr($image);

print "Text...\n";
print $text."\n";

print "Normal Exit\n";

exit;

That's it. Messy, I know, but there's no good way around the fact that the module installer really needs to be updated to support Windows (and other) systems even though the actual module code almost runs without modification. Really, if Tesseract and ImageMagick were installed to paths without spaces then the "Image::OCR::Tesseract" module code would not need any changes, but this minor tweak lets the supporting executables be installed anywhere, including the default locations.