All the Whitespace Characters? Is it language inde

2019-01-25 07:55发布

I was wondering if all the language treats the same set of characters as white space charactes or is there any variation.

Can anyone provide complete list of White space characters separating the one which can be entered from keyboard? If it's different, the difference and the reason would be more appropriate. Any language is helpful if you don't bring out Whitespace or its variants(if any). I certainly don't want a complete list for language like Whitespace :)

3条回答
Fickle 薄情
2楼-- · 2019-01-25 08:19

If you're looking for an efficient method, I use the following code:

(c <= 32 && c >= 0) || c == 127;

0 to 31 are the control characters, 32 is the SPACE character and 127 is the ESC character. This works for all the character sets I know, including UTF-8.

查看更多
Viruses.
3楼-- · 2019-01-25 08:26

Whether a particular character is categorized as a whitespace character or not should depend on the character set being used. That said, it is not impossible that a programming language can make its own definition of what constitutes whitespace.

Most modern languages use the Unicode Character set, which does have a definition for space separator characters. Any character in the Zs category is a space separator.

You can see the complete list here. In addition you can grep for ;Zs; in the official Unicode Character Database to see those characters. Note that the number of characters in this category may grow as new Unicode versions come into existence, so I will not say how many such characters exist, nor even attempt to list them.

In addition to the Zs Unicode category, Unicode also defines character properties. Among the properties defined by Unicode is a Whitespace property. As of Unicode 7.0, characters with this property include all of the characters with category Zs plus a few control characters (including U+0009, U+000A, U+000B, U+000C, U+000D, and U+0085). You can find all of the characters with the whitespace property at Unicode.org here.

Now many languages, even modern ones, have special symbols for regular expressions such as \s or [:space:] but beware, these only refer to certain characters from the ASCII set; generally these are restricted to

  • SPACE (codepoint 32, U+0020)
  • TAB (codepoint 9, U+0009)
  • LINE FEED (codepoint 10, U+000A)
  • LINE TABULATION (codepoint 11, U+000B)
  • FORM FEED (codepoint 12, U+000C)
  • CARRIAGE RETURN (codepoint 13, U+000D)

Now this list is interesting because it contains not only space separators (Zs), but also from the "Control, Other" category (Cc). This is what a programming language generally means when it uses the term "whitespace."

So probably the best way to answer your question for a "complete list" of whitespace characters is to say "it depends on what you mean." If you mean "classic whitespace" it is probably the six characters listed above. If you want something more "modern" then it is the union of those six with all the characters from the Unicode category Zs. Then again, you might need to look within other blocks, too (e.g., U+1361 as mentioned in a comment to your question by Jerry Coffin). It also depends on what you intend to do with these space characters.

Now one last thing: Unicode doesn't have every character in the world yet; it keeps growing. It is possible that someday new space characters will be added. For now, category Zs + the classics are your best bet.

查看更多
劳资没心,怎么记你
4楼-- · 2019-01-25 08:31

Ray's answer gives great information, but unfortunately it is lacking 3 whitespace characters. :(

Update: Ray has since updated his already then good answer to now even be more thoro and complete. I didn't know it was as complicated as it is. :) For a 'simple' answer, I provide the following. But it's very useful to understand the extra complications that he explains very nicely.

There are currently 25 Unicode whitespace characters with the following hexadecimal 'code points':

9, A, B, C, D, 20, 85, A0,
1680, 2000, 2001, 2002, 2003, 2004, 2005, 2006,
2007, 2008, 2009, 200A, 2028, 2029, 202F, 205F,
3000

Corresponding decimal values are:

9, 10, 11, 12, 13, 32, 133, 160,
5760, 8192, 8193, 8194, 8195, 8196, 8197, 8198,
8199, 8200, 8201, 8202, 8232, 8233, 8239, 8287,
12288

My reference is the official Unicode website itself, where i searched for "whitespace". So as the expression goes, i got it "from the horse's mouth". If you go to http://unicode.org/charts/uca/ you get a 2 frames with a left navigation frame, where you can click the 3rd link under the 'Help' link, which is the 'Whitespace' link. Unfortunately, the displayed frame is not what i'd call very 'user-friendly'. But the frame that does display gives a raw list of all the hexadecimal values of every Unicode white-space character. I believe that page is the most 'official' answer one can get.

查看更多
登录 后发表回答