The code below extracts Hebrew text from http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf without the Hebrew character "ן". All other text seems to be extracted fine. Any ideas?
public class TestPDFUtil {
@Test
public void testHebrewPDF() throws Exception {
String url = "http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf";
String text = PDFUtil.readPDF(url);
System.out.println(text);
Assert.assertTrue(text.indexOf("זיכרון עבודה") != -1);
}
}
public class PDFUtil {
public static String readPDF(String url) throws IOException {
URL urlObj = new URL(url);
PDDocument document = PDDocument.load(urlObj.openStream());
document.getClass();
if( !document.isEncrypted() ){
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
document.close();
return text.trim();
}
return null;
}
}
Attaching screen shots that show the missing character. On the left is how the page http://www.language-brain.com/journal/docs/Gvion_Friedmann_LanguageBrain7_frigvi.pdf appears in Crome. On the right is the result of PDF text extraction using the code above.