采用正方体我有提取iPhone.Now想在XML文本位置沿提取文本的文本。 我USET GetHocrText其检索HTML文本。
对于如: -
<span class='ocr_word' id='word_3_28' title="bbox 55 226 123 243">
<span class='ocrx_word' id='xword_3_28' title="x_wconf -5">Beverage</span>
</span>
是否有任何其他的方式来提取XML格式的文本中的Tesseract OCR?
谢谢进阶
斯里韦德亚
更好的方式来做到这一点是使用ResultIterator; 你可以用正方体:: RIL_BLOCK,正方体:: RIL_PARA,正方体:: RIL_TEXTLINE,正方体:: RIL_WORD,或正方体:: RIL_SYMBOL
从https://code.google.com/p/tesseract-ocr/wiki/APIExample :
tesseract::TessBaseAPI api;
// tesseract.Init here
api.SetVariable("save_blob_choices", "T");
// tesseract.SetImage/tesseract.SetRectangle here
api.Recognize(NULL);
tesseract::ResultIterator* ri = api.GetIterator();
tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
if (ri) {
do {
const char* word = ri->GetUTF8Text(level);
float conf = ri->Confidence(level);
int x1, y1, x2, y2;
ri->BoundingBox(level, &x1, &y1, &x2, &y2);
printf("word: '%s'; \tconf: %.2f; BoundingBox: %d,%d,%d,%d;\n",
word, conf, x1, y1, x2, y2);
delete[] word;
} while (ri->Next(level));
}
这不是XML,但这是让每个字符的位置的一种方法:
tesseract::TessBaseAPI tesseract;
// tesseract.Init here
tesseract.SetVariable("save_blob_choices", "T"); // for character-level confidence
// tesseract.SetImage/tesseract.SetRectangle here
char *results_as_text = tesseract.GetBoxText(0); // characters without spaces/newlines artificially embedded
std::istringstream results_as_stream(results_as_text);
std::string result;
char letter;
int x1, y1, x2, y2;
while (std::getline(results_as_stream,result)) {
std::istringstream result_stream(result);
result_stream >> letter;
result_stream >> x1;
result_stream >> y1;
result_stream >> x2;
result_stream >> y2;
std::cout << letter << " ((" << x1 << "," << y1 << "),(" << x2 << "," << y2 << "))" << std::endl;
}