What is the typical method to separate connected l

2019-01-25 17:37发布

I am very new to OCR and almost know nothing about the algorithms used to recognize words. I am just getting familiar to that.

Could anybody please advise on the typical method used to recognize and separate individual characters in connected form (I mean in a word where all letters are linked together)? Forget about handwriting, supposing the letters are connected together using a known font, what is the best method to determine each individual character in a word? When characters are written separately there is no problem, but when they are joined together, we should know where every single character starts and ends in order to go to the next step and match them individually to a letter. Is there any known algorithm for that?

标签： algorithm ocr

1条回答

不美不萌又怎样

2楼-- · 2019-01-25 17:40

The standard term for this process is "character segmentation" - segmentation is the image processing term for breaking images into grouped areas for recognition. "Arabic character segmentation" throws up a lot of hits in google scholar if you want to learn more.

I'd encourage you to look at Tesseract - an open source OCR implementation, especially the documents.

Feature as defined in the glossary has a bit on this, but there is a ton of information here.

Basically Tesseract solves the problem (from How Tesseract Works) by looking at blobs (not letters) then combining those blobs into words. This avoids the problem you describe, while creating new problems.

For arabic (as you point out) Tesseract doesn't work. I don't know much about this area but this paper seems to imply Dynamic Time Warping (DTW) is a useful technique. This tries to stretch the words to match them to known words, and again works in word rather than letter space.

0人赞添加讨论(0) 举报

What is the typical method to separate connected l

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间