Separate image of text into component character im

2019-02-10 04:41发布

I'd like to separate an image of text into it's component characters, also as images. For example, using the sample below I'd end up with 14 images.

I'm only going to be using text on a single line, so the y-height is unimportant - what I need to find is the beginning and end of each letter and crop to those coordinates. That way I would also avoid problems with 'i','j', etc.

I'm new to image processing, and I'm not sure how to go about it. Some form of edge detection? Is there a way to determine contiguous regions of solid colour? Any help is great.

Trying to improve my Python skills and familiarity with some of the many libraries available, so I'm using the Python Imaging Library (PIL), but I've also had a look at OpenCV.


Sample image:

This is some text

6条回答
放荡不羁爱自由
2楼-- · 2019-02-10 05:16

I know I am few years late :-) but you can do this sort of thing with ImageMagick pretty easily now, straight at the command-line without compiling anything, as it has Connected Component Analysis built-in:

Here is one way to do it like that:

#!/bin/bash
image="$1"
draw=$(convert $image                              \
   -threshold 50%                                  \
   -define connected-components:verbose=true       \
   -define connected-components:area-threshold=10  \
   -connected-components 8                         \
   -auto-level objects.png | \
   awk 'BEGIN{command=""}
        /\+0\+0/||/id:/{next}
        {
          geom=$2
          gsub(/x/," ",geom)
          gsub(/+/," ",geom)
          split(geom,a," ")
          d=sprintf("-draw \x27rectangle %d,%d %d,%d\x27 ",a[3],a[4],a[3]+a[1],a[4]+a[2])
          command = command d
          #printf "%d,%d %d,%d\n",a[3],a[4],a[3]+a[1],a[4]+a[2]
        }
        END{print command}')

eval convert "$image" -fill none -strokewidth 2 -stroke red $draw result.png

The result looks like this:

enter image description here

First, I threshold your image at 50% so that there are only pure blacks and whites in it, no tonal gradations. Then I tell ImageMagick to output details of the bounding boxes it finds, and that I am not interested in objects smaller than 10 pixels of total area. I then allow pixels to be 8-connected, i.e. to their diagonal neighbours (NE,SE,NW,SW) as well as their left-right and above-below neighbours. Finally I parse the bounding box output with awk to draw in red lines around the bounding boxes.

The output of the initial command that I parse with awk looks like this:

Objects (id: bounding-box centroid area mean-color):
  0: 539x53+0+0 263.7,24.3 20030 srgba(255,255,255,1)
  11: 51x38+308+14 333.1,30.2 869 srgba(0,0,0,1)
  13: 35x39+445+14 461.7,32.8 670 srgba(0,0,0,1)
  12: 35x39+365+14 381.7,32.8 670 srgba(0,0,0,1)
  2: 30x52+48+0 60.4,27.0 634 srgba(0,0,0,1)
  1: 41x52+1+0 20.9,16.6 600 srgba(0,0,0,1)
  8: 30x39+174+14 188.3,33.1 595 srgba(0,0,0,1)
  7: 30x39+102+14 116.3,33.1 595 srgba(0,0,0,1)
  9: 30x39+230+14 244.3,33.1 595 srgba(0,0,0,1)
  10: 35x39+265+14 282.2,33.0 594 srgba(0,0,0,1)
  16: 33x37+484+15 500.2,33.0 520 srgba(0,0,0,1)
  17: 22x28+272+19 282.3,32.8 503 srgba(255,255,255,1)
  5: 18x51+424+2 432.5,27.9 389 srgba(0,0,0,1)
  6: 18x51+520+2 528.5,27.9 389 srgba(0,0,0,1)
  15: 6x37+160+15 162.5,33.0 222 srgba(0,0,0,1)
  14: 6x37+88+15 90.5,33.0 222 srgba(0,0,0,1)
  18: 22x11+372+19 382.6,24.9 187 srgba(255,255,255,1)
  19: 22x11+452+19 462.6,24.9 187 srgba(255,255,255,1)
  3: 6x8+88+0 90.5,3.5 48 srgba(0,0,0,1)
  4: 6x8+160+0 162.5,3.5 48 srgba(0,0,0,1)

and the awk turns that into this

convert http://imgur.com/AVW7A.png -fill none -strokewidth 2 -stroke red \
-draw 'rectangle 308,14 359,52'        \
-draw 'rectangle 445,14 480,53'        \
-draw 'rectangle 365,14 400,53'        \
-draw 'rectangle 48,0 78,52'           \
-draw 'rectangle 1,0 42,52'            \
-draw 'rectangle 174,14 204,53'        \
-draw 'rectangle 102,14 132,53'        \
-draw 'rectangle 230,14 260,53'        \
-draw 'rectangle 265,14 300,53'        \
-draw 'rectangle 484,15 517,52'        \
-draw 'rectangle 272,19 294,47'        \
-draw 'rectangle 424,2 442,53'         \
-draw 'rectangle 520,2 538,53'         \
-draw 'rectangle 160,15 166,52'        \
-draw 'rectangle 88,15 94,52'          \
-draw 'rectangle 372,19 394,30'        \
-draw 'rectangle 452,19 474,30'        \
-draw 'rectangle 88,0 94,8'            \
-draw 'rectangle 160,0 166,8' result.png
查看更多
Emotional °昔
3楼-- · 2019-02-10 05:18

Um, this is actually very easy for the sample you provided:

start at left edge
  go right 1 column at a time until the current column contains black (a letter)
  this is the start of the character
  go right again till no black at all in current column
  end of character
repeat till end of image

(Incidentally, this also works for splitting a paragraph into lines.)
If the letters overlap or share columns, it gets a little more difficult interesting.

Edit:

@Andres, no, it works fine for 'U', you have to look at all of each column

 U   U
 U   U
 U   U
 U   U
  UUU
 01234

0,4:everything but bottom row
1-3:only bottom row
查看更多
爱情/是我丢掉的垃圾
4楼-- · 2019-02-10 05:27

You could start with a simple connected components analysis (CCA) algorithm, which can be implemented quite efficiently with a scanline algorithm (you just keep track of merged regions and relabel at the end). This would give you separately numbered "blobs" for each continuous region, which would work for most (but not all) letters. Then you can simply take the bounding box of each connected blob, and that will give you the outline for each. You can even maintain the bounding box as you apply CCA for efficiency.

So in your example, the first word from the left after CCA would result in something like:

1111111  2         3
   1     2
   1     2 4444    5  666
   1     22    4   5 6
   1     2     4   5  666
   1     2     4   5     6
   1     2     4   5  666

with equivalence classes of 4=2.

Then the bounding boxes of each blob gives you the area around the letter. You will run into problems with letters such as i and j, but they can be special-cased. You could look for a region less than a certain size, which is above another region of a certain width (as a rough heuristic).

The cvBlobsLib library in OpenCV should do most of this for you.

查看更多
时光不老,我们不散
5楼-- · 2019-02-10 05:36

I've been playing around with ocropus recently, an open-source text analysis and ocr-preprocessing tool. As a part of its workflow, it also creates the images you want. Maybe this helps you, although no python magic is involved.

查看更多
The star\"
6楼-- · 2019-02-10 05:38

This is not an easy task especially if the background is not uniform. If what you have is an already binary image like the example, it is slightly simpler.

You can start applying a threshold algorithm if your image is not binary (Otsu adaptative threshold works well)

After you can use a labelling algorithm in order to identify each 'island'of pixels which forms your shapes (each character in this case).

The problem arises when you have noise. Shapes that were labelled but aren't of your interest. In this case you can use some heuristic to determine when a shape is a character or not (you can use normalized area, position of the object if your text is in a well define place etc). If this is not enough, you will need to deal with more complex staff like shape feature extraction algorithms and some sort of pattern recognition algorithm, like multilayer perceptrons.

To finish, this seems to be an easy task, but depending the quality of your image, it could get harder. The algorithms cited here can easily be found on the internet or also implemented in some libraries like OpenCv.

Any more help, just ask, if I can help of course ;)

查看更多
放荡不羁爱自由
7楼-- · 2019-02-10 05:39

The problem you have posed is really hard—it took some of the world's best image-processing researchers quite some time to solve. The solution is a major part of the Djvu image-compression and display toolset: their first step in compressing a document is to identify foreground and split it into characters. They then use the information to help compression because the image of one lowercase 'e' is much like another—the compressed document needs to contain only the differences. You'll find links to a bunch of technical papers at http://djvu.org/resources/; a good place to start is with High Quality Document Image Compression with Djvu.

A good many of the tools in the Djvu suite have been open-sourced under the title djvulibre; unfortunately, I have not been able to figure out how to pull out the foreground (or individual characters) using the existing command-line tools. I would be very interested to see this done.

查看更多
登录 后发表回答