How to find the “real” number of characters in a U

2020-07-22 18:28发布

I know how to find the length of a not Unicode string in R.

nchar("ABC")

(thanks everyone who answered the question here: How to find the length of a string in R? ).

But what about Unicode strings?

How to find the length of a string (number of characters in a string) in a Unicode strings? How do I find the length (in bytes) and the number of characters (runes, symbols) in a Unicode string in R?

标签： r string unicode string-length

1条回答

该账号已被封号

2楼-- · 2020-07-22 18:59

You can use nchar for the number of characters and for the number of bytes:

nchar("bi\u00dfchen", type="chars")
#[1] 7
nchar("bi\u00dfchen", type="bytes")
#[1] 8

Indeed, in the help, you can find details about how to compute the string size:

The ‘size’ of a character string can be measured in one of three ways (corresponding to the type argument):
bytes
The number of bytes needed to store the string (plus in C a final terminator which is not counted). chars
The number of human-readable characters.
width
The number of columns cat will use to print the string in a monospaced font. The same as chars if this cannot be calculated.

If you want to know the number of "symbols" inside the string that may (or may not) contain unicode (i.e. without interpreting the unicode symbol), you can use function stri_escape_unicode from package stringi:

library(stringi)
nchar(stri_escape_unicode("bi\u00dfchen")) # same as stri_length(stri_escape_unicode("bi\u00dfchen"))
# [1] 12

0人赞添加讨论(0) 举报

How to find the “real” number of characters in a U

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间