是否正方体的HOCR输出真正含有边框和置信水平为每个字符?(Does Tesseract's

2019-08-19 18:02发布

在正方体常见问题 ,他们说,你可以:

我怎样才能获得的坐标和每个角色的信心?

有两个选项。 如果您不希望进入编程,您可以用正方体的HOCR输出格式(阅读的Tesseract手册页详细说明)。

但是,当我创建了一个样本HOCR输出(这是一个.html文件),包围盒和置信水平只能在单词级别

我失去了一些东西在这里?

我已经添加作为例证样品输入/输出(输入被调整大小)。


这是输入图像:


这是正方体的HOCR输出:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='tesseract'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "in2.tif"; bbox 0 0 1882 354'>
<div class='ocr_carea' id='block_1_1' title="bbox 78 59 457 100">
<p class='ocr_par'>
<span class='ocr_line' id='line_1_1' title="bbox 78 61 456 97"><span class='ocr_word' id='word_1_1' title="bbox 78 62 175 97"><span class='ocrx_word' id='xword_1_1' title="x_wconf -2">Dear</span></span> <span class='ocr_word' id='word_1_2' title="bbox 205 62 271 96"><span class='ocrx_word' id='xword_1_2' title="x_wconf -14">Mr:</span></span> <span class='ocr_word' id='word_1_3' title="bbox 303 61 456 97"><span class='ocrx_word' id='xword_1_3' title="x_wconf -2">Grover:</span></span></span>
</p>
</div>
<div class='ocr_carea' id='block_1_2' title="bbox 75 154 1842 317">
<p class='ocr_par'>
<span class='ocr_line' id='line_1_2' title="bbox 78 161 1787 210"><span class='ocr_word' id='word_1_4' title="bbox 78 161 111 196"><span class='ocrx_word' id='xword_1_4' title="x_wconf -2">If</span></span> <span class='ocr_word' id='word_1_5' title="bbox 137 161 270 205"><span class='ocrx_word' id='xword_1_5' title="x_wconf -2">you&#39;ve</span></span> <span class='ocr_word' id='word_1_6' title="bbox 298 162 393 197"><span class='ocrx_word' id='xword_1_6' title="x_wconf -1">been</span></span> <span class='ocr_word' id='word_1_7' title="bbox 422 161 571 206"><span class='ocrx_word' id='xword_1_7' title="x_wconf -3">looking</span></span> <span class='ocr_word' id='word_1_8' title="bbox 598 162 657 197"><span class='ocrx_word' id='xword_1_8' title="x_wconf -2">for</span></span> <span class='ocr_word' id='word_1_9' title="bbox 685 174 707 198"><span class='ocrx_word' id='xword_1_9' title="x_wconf -1">a</span></span> <span class='ocr_word' id='word_1_10' title="bbox 734 162 929 207"><span class='ocrx_word' id='xword_1_10' title="x_wconf -4">reporting</span></span> <span class='ocr_word' id='word_1_11' title="bbox 956 163 1031 198"><span class='ocrx_word' id='xword_1_11' title="x_wconf -1">tool</span></span> <span class='ocr_word' id='word_1_12' title="bbox 1059 162 1140 199"><span class='ocrx_word' id='xword_1_12' title="x_wconf -3">that</span></span> <span class='ocr_word' id='word_1_13' title="bbox 1168 164 1294 199"><span class='ocrx_word' id='xword_1_13' title="x_wconf -4">allows</span></span> <span class='ocr_word' id='word_1_14' title="bbox 1321 175 1428 200"><span class='ocrx_word' id='xword_1_14' title="x_wconf -1">users</span></span> <span class='ocr_word' id='word_1_15' title="bbox 1456 169 1494 200"><span class='ocrx_word' id='xword_1_15' title="x_wconf -3">to</span></span> <span class='ocr_word' id='word_1_16' title="bbox 1523 169 1649 200"><span class='ocrx_word' id='xword_1_16' title="x_wconf -2">create</span></span> <span class='ocr_word' id='word_1_17' title="bbox 1677 170 1787 210"><span class='ocrx_word' id='xword_1_17' title="x_wconf -3">great</span></span></span>
<span class='ocr_line' id='line_1_3' title="bbox 77 210 1841 260"><span class='ocr_word' id='word_1_18' title="bbox 77 210 226 256"><span class='ocrx_word' id='xword_1_18' title="x_wconf -3">looking</span></span> <span class='ocr_word' id='word_1_19' title="bbox 253 216 399 256"><span class='ocrx_word' id='xword_1_19' title="x_wconf -4">reports</span></span> <span class='ocr_word' id='word_1_20' title="bbox 427 211 581 256"><span class='ocrx_word' id='xword_1_20' title="x_wconf -3">quickly,</span></span> <span class='ocr_word' id='word_1_21' title="bbox 613 224 654 248"><span class='ocrx_word' id='xword_1_21' title="x_wconf -2">as</span></span> <span class='ocr_word' id='word_1_22' title="bbox 682 213 763 248"><span class='ocrx_word' id='xword_1_22' title="x_wconf -1">well</span></span> <span class='ocr_word' id='word_1_23' title="bbox 792 224 832 248"><span class='ocrx_word' id='xword_1_23' title="x_wconf -1">as</span></span> <span class='ocr_word' id='word_1_24' title="bbox 859 212 1056 258"><span class='ocrx_word' id='xword_1_24' title="x_wconf -4">providing</span></span> <span class='ocr_word' id='word_1_25' title="bbox 1083 212 1144 249"><span class='ocrx_word' id='xword_1_25' title="x_wconf -2">the</span></span> <span class='ocr_word' id='word_1_26' title="bbox 1173 214 1315 249"><span class='ocrx_word' id='xword_1_26' title="x_wconf -2">control</span></span> <span class='ocr_word' id='word_1_27' title="bbox 1344 215 1417 249"><span class='ocrx_word' id='xword_1_27' title="x_wconf -2">and</span></span> <span class='ocr_word' id='word_1_28' title="bbox 1445 214 1639 250"><span class='ocrx_word' id='xword_1_28' title="x_wconf -2">industrial</span></span> <span class='ocr_word' id='word_1_29' title="bbox 1667 215 1841 260"><span class='ocrx_word' id='xword_1_29' title="x_wconf -3">strength</span></span></span>
<span class='ocr_line' id='line_1_4' title="bbox 76 260 1370 306"><span class='ocr_word' id='word_1_30' title="bbox 76 261 243 296"><span class='ocrx_word' id='xword_1_30' title="x_wconf -2">features</span></span> <span class='ocr_word' id='word_1_31' title="bbox 272 260 353 297"><span class='ocrx_word' id='xword_1_31' title="x_wconf -2">that</span></span> <span class='ocr_word' id='word_1_32' title="bbox 381 273 427 297"><span class='ocrx_word' id='xword_1_32' title="x_wconf -1">an</span></span> <span class='ocr_word' id='word_1_33' title="bbox 458 261 499 297"><span class='ocrx_word' id='xword_1_33' title="x_wconf -2">IS</span></span> <span class='ocr_word' id='word_1_34' title="bbox 527 262 776 306"><span class='ocrx_word' id='xword_1_34' title="x_wconf -2">professional</span></span> <span class='ocr_word' id='word_1_35' title="bbox 804 263 1110 299"><span class='ocrx_word' id='xword_1_35' title="x_wconf -2">demands...look</span></span> <span class='ocr_word' id='word_1_36' title="bbox 1139 275 1184 299"><span class='ocrx_word' id='xword_1_36' title="x_wconf -1">no</span></span> <span class='ocr_word' id='word_1_37' title="bbox 1212 263 1370 299"><span class='ocrx_word' id='xword_1_37' title="x_wconf -3">further!</span></span></span>
</p>
</div>
</div>
</body>
</html>

Answer 1:

你已经看到了它:它不存在。

所以,你可以修改的Tesseract源代码以支持您要或使用其x_confs属性输出HOCR格式ResultIterator API类来获得在文字(符号)的水平(一定要自信SetVariable("save_blob_choices", "T")后, Init方法)。



文章来源: Does Tesseract's hOCR output really contain bounding boxes and confidence levels for each character?