English text extracted using itextpdf is not under

I'm trying to extract and print english text out of a pdf on console. Extraction is done through itextpdf API using PdfTextExtractor class. Text i'm getting is not understandble. May be some language issues I'm facing. My intent is to find a particular text within a PDF and replace it with some other string. I started with parsing the file to find the string. Following code snippet represents my string extractor:

Document document = new Document();

PdfWriter writer = PdfWriter.getInstance(document,
    new FileOutputStream(OUTPUTFILE));
document.open();
PdfReader reader = new PdfReader(input);
int n = reader.getNumberOfPages();
PdfImportedPage page;
// Go through all pages
for (int i = 1; i <= n; i++) {

    String str=PdfTextExtractor.getTextFromPage(reader, i); 
    System.out.println(str);  

}
document.close();

but the output I'm getting on console is not understandable even though the text in the PDF is in english.

Output:

t cotenn dna o mntoafinir yales r ni et h layhcsip Amgteu end y Retila m eysts w tih eth erss p wlli e erefcern emsyst o f et h se. ru I n tioi, dnda etseh orpvedi eddda e ulav o t taw h s i oelbssip hwti se vdcie ollaw na s tiouquibu cacess o t latoutenxc e rpap dna t ilagid ottennc olae n ewnh ey th krwo tofoi. nmirna ni soitaoli n mor f chea e. roth s iTh s i a cel ra csea ewerh " eth lweoh is ermo nath eth ms u fo sti

rtasp ".

Can anybody please help me out what could be the possible solution for bringing text in english language as it is like in source PDF. Any sort of help will be highly appreciated.

标签： java parsing pdf pdf-generation itext

1条回答

Summer. ? 凉城

2楼-- · 2019-02-25 13:27

If you want the text to be ordered based on its position on the page, you need to introduce a specific strategy, such as the LocationTextExtractionStrategy:

for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    String str=PdfTextExtractor.getTextFromPage(reader, i, new LocationTextExtractionStrategy());
}

The LocationTextExtractionStrategy sometimes results in odd sentences, more specifically if the letters 'dance' on the page (the baseline of the glyphs differs for text on the same line). In that case, you can try the SimpleTextExtractionStrategy which will return the text in the order in which it appears in the PDF syntax content stream.

0人赞添加讨论(0) 举报

English text extracted using itextpdf is not under

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间