Extracting Hebrew text from PDF using apache pdfbo

2019-07-24 17:11发布

The code below extracts Hebrew text from http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf without the Hebrew character "ן". All other text seems to be extracted fine. Any ideas?

public class TestPDFUtil {
    @Test
    public void testHebrewPDF() throws Exception {
        String url = "http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf";
        String text = PDFUtil.readPDF(url);
        System.out.println(text);
        Assert.assertTrue(text.indexOf("זיכרון עבודה") != -1);
    }
}

public class PDFUtil {
    public static String readPDF(String url) throws IOException {
        URL urlObj = new URL(url);
        PDDocument document = PDDocument.load(urlObj.openStream());
        document.getClass();
        if( !document.isEncrypted() ){
            PDFTextStripper stripper = new PDFTextStripper();
            String text = stripper.getText(document);
            document.close();
            return text.trim();
        }
        return null;
    }
}

Attaching screen shots that show the missing character. On the left is how the page http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf appears in Crome. On the right is the result of PDF text extraction using the code above.

标签： java pdf pdfbox

0条回答

Extracting Hebrew text from PDF using apache pdfbo

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间