The introductory guide to Julia, Learn Julia in Y Minutes, discourages users from indexing UTF8 strings:
# Some strings can be indexed like an array of characters
"This is a string"[1] # => 'T' # Julia indexes from 1
# However, this is will not work well for UTF8 strings,
# so iterating over strings is recommended (map, for loops, etc).
Why is iterating over such strings discouraged? What specifically about the structure of this alternate string type makes indexing error prone? Is this a Julia specific pitfall, or does this extend to all languages with UTF8 string support?
Because in UTF8 a character is not always encoded in a single byte.
Take for example the german language string
böse
(evil). The bytes of this string in UTF8 encoding are:As you can see the umlaut
ö
requires 2 bytes.Now if you directly index this UTF8 encoded string
"böse"[4]
will give yous
and note
.However, you can use the string as an iterable object in julia:
And since you've asked, No, direct byte indexing issues with UTF8 strings are not specific to Julia.
Recommendation for further reading:
http://docs.julialang.org/en/release-0.4/manual/strings/#unicode-and-utf-8
Just to expand upon Scott Jones' comment, Julia actually also offers fixed-width strings similar to the
std::wstring
from C++, which allows for convenient indexing. They are now in https://github.com/JuliaStrings/LegacyStrings.jl One needs to install the package first withPkg.add("LegacyStrings")
.UTF32String
would be the best choice for most use cases. To construct anUTF32String
from a normal string:s2 = utf32(s)
.