How can I get the number of characters of a string in Go?
For example, if I have a string "hello"
the method should return 5
. I saw that len(str)
returns the number of bytes and not the number of characters so len("£")
returns 2 instead of 1 because £ is encoded with two bytes in UTF-8.
You can try
RuneCountInString
from the utf8 package.that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but its rune count is 2:
Phrozen adds in the comments:
Actually you can do
len()
over runes by just type casting.len([]rune("世界"))
will print2
. At leats in Go 1.3.And with CL 108985 (May 2018, for Go 1.11),
len([]rune(string))
is now optimized. (Fixes issue 24923)The compiler detects
len([]rune(string))
pattern automatically, and replaces it with for r := range s call.Stefan Steiger points to the blog post "Text normalization in Go"
What is a character?
Using that package and its
Iter
type, the actual number of "character" would be:Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"
Depends a lot on your definition of what a "character" is. If "rune equals a character " is OK for your task (generally it isn't) then the answer by VonC is perfect for you. Otherwise, it should be probably noted, that there are few situations where the number of runes in a Unicode string is an interesting value. And even in those situations it's better, if possible, to infer the count while "traversing" the string as the runes are processed to avoid doubling the UTF-8 decode effort.
There is a way to get count of runes without any packages by converting string to []rune as
len([]rune(YOUR_STRING))
:If you need to take grapheme clusters into account, use regexp or unicode module. Counting the number of code points(runes) or bytes also is needed for validaiton since the length of grapheme cluster is unlimited. If you want to eliminate extremely long sequences, check if the sequences conform to stream-safe text format.