Extracting Hebrew text from PDF using apache pdfbo

2019-07-24 17:22发布

问题:

The code below extracts Hebrew text from http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf without the Hebrew character "ן". All other text seems to be extracted fine. Any ideas?

public class TestPDFUtil {
    @Test
    public void testHebrewPDF() throws Exception {
        String url = "http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf";
        String text = PDFUtil.readPDF(url);
        System.out.println(text);
        Assert.assertTrue(text.indexOf("זיכרון עבודה") != -1);
    }
}

public class PDFUtil {
    public static String readPDF(String url) throws IOException {
        URL urlObj = new URL(url);
        PDDocument document = PDDocument.load(urlObj.openStream());
        document.getClass();
        if( !document.isEncrypted() ){
            PDFTextStripper stripper = new PDFTextStripper();
            String text = stripper.getText(document);
            document.close();
            return text.trim();
        }
        return null;
    }
}

Attaching screen shots that show the missing character. On the left is how the page http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf appears in Crome. On the right is the result of PDF text extraction using the code above.