Java code reads UTF-8 text incorrectly

2019-07-11 08:02发布

I'm having a problem reading UTF-8 characters in my code (running on Eclipse).

I have a file text which has a few lines in it, for example:

אך  1234

NOTE: There is a \t before the word, and the word should appear on the left, the number on the right... I don't know how to reverse them here, sorry.

That is, a Hebrew word and then a number.

I need to separate the word from the number somehow. I tried this:

        BufferedReader br = new BufferedReader(new FileReader(text));
        String content;

        while ((content = br.readLine()) != null) 
        {
            String delims = "[ ]+";
            String[] tokens = content.split(delims);
        }

The problem is that for some reason, the code reads content (the first line in the file) as follows:

אך\t1234

...meaning that the space isn't in its correct place.

I suppose I could tokenize the text using the \t, but I'm not sure I should do it, as the file isn't being read correctly...

Does anyone have any idea why this happens?

Thanks so much :-)

标签： java utf-8 tokenize hebrew

1条回答

Root（大扎）

2楼-- · 2019-07-11 08:19

I think you are matching a space when there actually is a tab there?

Can you try this:

BufferedReader br = new BufferedReader(new FileReader(text));
String content;

while ((content = br.readLine()) != null) 
{
    String delims = "\\s";
    String[] tokens = content.split(delims);
}

0人赞添加讨论(0) 举报

Java code reads UTF-8 text incorrectly

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间