Excluding super script when extracting text from p

2019-07-28 10:51发布

I have extracted text from pdf line by line using pdfbox, to process it with my algorithm by sentences.

I am recognizing the sentences by using period(.) followed by a word whose first letter is capital. Here the issue is, when a sentence ends with a word which has superscript, extractor treats it as a normal character and places it next to period(.)

For example: expression "2 power 22" when appeared as a last word in a sentence i.e. with a period, it has been extracted as 2.22 which makes it difficult to identify the end of sentence.

Please suggest a solution to get rid of super script or a different logic to identify the end of sentence.

Thanks.

标签： parsing extract pdfbox superscript sentence

1条回答

等我变得足够好

2楼-- · 2019-07-28 11:13

I am answering my own questions, as some may get directed here.

I had solved this according to @mkl suggestion. After observing the result of getYScale() in PDFStreamEngine.java, I have come to a conclusion that the size of superscript was less than 8.9663. so I had kept a condition in the PDFStreamEngine.java before creating a TextPosition, which will be processed by PDFTextStripper.java. The code is below:

if(textXctm.getYScale()>=8.9663) {
    processTextPosition(
        new TextPosition(
            pageRotation,
            pageWidth,
            pageHeight,
            textMatrixStart,
            endXPosition,
            endYPosition,
            totalVerticalDisplacementDisp,
            widthText,
            spaceWidthDisp,
            c,
            codePoints,
            font,
            fontSizeText,
            (int)(fontSizeText * textMatrix.getXScale())
    ));
}

Let me know if my approach has any flaws in eliminating only the superscripts. Thanks.

0人赞添加讨论(0) 举报

Excluding super script when extracting text from p

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间