How to get Unicode of the characters from PDF usin

2019-02-25 03:52发布

站内文章 / Java

16 0

在下西门庆

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am using Apache PDFBox and Java to parse the PDFs and get all the information from it. Extracting text is working fine for English only. For other languages I get only some special-characters. For example extracting the Arabic character ش will give the String :"? on printing. It is working fine when I change the "Region and language" of my computer from English to Arabic. So I think extracting the Unicode of the characters will solve this problem. Please help me to get the Unicode of the characters from PDF or suggest me some solutions to solve this problem.

回答1:

Try changing the Java system locale. From your Java program, this should be equivalent to changing the OS setting.

回答2:

http://grepcode.com/file/repo1.maven.org/maven2/org.apache.pdfbox/pdfbox/1.6.0/org/apache/pdfbox/util/PDFText2HTML.java

The private String escape(String chars) converts characters to unicode.

标签： java pdf unicode pdfbox

在下西门庆

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~