I'm trying to match a search string with a file name with a recursive directory search on Android. The problem is that the characters are Japanese, and it's not matching in some cases. For example, the search string I'm trying to match the start of the file name with is “呼ぶ”. When I print the file names, from file.getName(), this is accurately reflected, e.g. the file name printed to the console starts with “呼ぶ”. But when I do a match on the search string, e.g. fileName.startwith(“呼ぶ”), it doesn't match.
It turns out that when I print the substring of the file name being searched, the second character is different – the word is “呼ふ” instead of “呼ぶ”. If I extract the bytes and print the hex characters, the last byte is off by 1 – presumably the difference between “ぶ” and “ふ”.
Here is the code used to show the difference:
String name = soundFile.getName();
String string1 = question.kanji;
Log.d(TAG, "searching for : s1:" + question.kanji + " + " + question.hiragana + " + " + question.english);
Log.d(TAG, "name is: " + name);
Log.d(TAG, "question.kanaji.length(): " + question.kanji.length());
Log.d(TAG, "question.hiragana.length(): " + question.hiragana.length());
String compareStart = name.substring(0, string1.length() );
Log.d(TAG, "string1.length(): " + string1.length());
Log.d(TAG, "compareStart.length(): " + compareStart.length());
byte[] nameUTF8 = null;
byte[] s1UTF8 = null;
byte[] csUTF8 = null;
nameUTF8 = name.getBytes();
s1UTF8 = string1.getBytes();
csUTF8 = compareStart.getBytes();
Log.d(TAG, "nameUTF8.length: " + s1UTF8.length);
Log.d(TAG, "s1UTF8.length: " + s1UTF8.length);
Log.d(TAG, "csUTF8.length: " + csUTF8.length);
for (int i = 0; i < s1UTF8.length; i++) {
Log.d(TAG, "s1UTF8[i]: " + Integer.toString(s1UTF8[i] & 0xff, 16).toUpperCase());
}
for (int i = 0; i < csUTF8.length; i++) {
Log.d(TAG, "csUTF8[i]: " + Integer.toString(csUTF8[i] & 0xff, 16).toUpperCase());
}
for (int i = 0; i < nameUTF8.length; i++) {
Log.d(TAG, "nameUTF8[i]: " + Integer.toString(nameUTF8[i] & 0xff, 16).toUpperCase());
}
The partial output is as follows:
D/AnswerView(12078): searching for : s1:呼ぶ + よぶ + to call out,to invite
D/AnswerView(12078): name is: 呼ぶ よぶ to call out,to invite.mp3
D/AnswerView(12078): question.kanaji.length(): 2
D/AnswerView(12078): question.hiragana.length(): 2
D/AnswerView(12078): string1: 呼ぶ
D/AnswerView(12078): compareStart: 呼ふ
D/AnswerView(12078): string1.length(): 2
D/AnswerView(12078): compareStart.length(): 2
D/AnswerView(12078): string1.length(): 2
D/AnswerView(12078): compareStart.length(): 2
D/AnswerView(12078): nameUTF8.length: 6
D/AnswerView(12078): s1UTF8.length: 6
D/AnswerView(12078): csUTF8.length: 6
D/AnswerView(12078): s1UTF8[i]: E5
D/AnswerView(12078): s1UTF8[i]: 91
D/AnswerView(12078): s1UTF8[i]: BC
D/AnswerView(12078): s1UTF8[i]: E3
D/AnswerView(12078): s1UTF8[i]: 81
D/AnswerView(12078): s1UTF8[i]: B6
D/AnswerView(12078): csUTF8[i]: E5
D/AnswerView(12078): csUTF8[i]: 91
D/AnswerView(12078): csUTF8[i]: BC
D/AnswerView(12078): csUTF8[i]: E3
D/AnswerView(12078): csUTF8[i]: 81
D/AnswerView(12078): csUTF8[i]: B5
D/AnswerView(12078): nameUTF8[i]: E5
D/AnswerView(12078): nameUTF8[i]: 91
D/AnswerView(12078): nameUTF8[i]: BC
D/AnswerView(12078): nameUTF8[i]: E3
D/AnswerView(12078): nameUTF8[i]: 81
D/AnswerView(12078): nameUTF8[i]: B5
D/AnswerView(12078): nameUTF8[i]: E3
D/AnswerView(12078): nameUTF8[i]: 82
D/AnswerView(12078): nameUTF8[i]: 99
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
D/AnswerView(12078): nameUTF8[i]: 20
Showing that the sixth byte of the extracted substring of the file name, as well as the file name itself, is "B5" instead of "B6" as it is in the search string. However, the printed file name is correctly displayed. I'm stumped. Why is the file name being correctly displayed to the console when the underlying characters are different? Why are there an additional 3 non-blank bytes at the beginning of the file name - which somehow aren't needed in the search string to represent the "ぶ" character?
Here you are using a length taken from
string1
to slicename
. As Tom has pointed out, the strings are on different normalization forms, so their lengths don't need to coincide.The problem looks to be one of normalization forms. I know that on a Mac, for example, the filesystem is always in NFD. But the string you posted is in NFC. Watch:
So I think you are going to have to think about converting to NFD.
BTW, that U+547C CJK code point happens to be this from the Unihan database: