What is the exact difference between Windows-1252(

2019-01-31 18:45发布

We are hosting PHP apps on a Debian based LAMP installation. Everything is quite ok - performance, administrative and management wise. However being a somewhat new devs (we're still in high-school) we've run into some problems with the character encoding for Western Charsets.

After doing a lot of researches I have come to the conclusion that the information online is somewhat confusing. It's talking about Windows-1252 being ANSI and totally ISO-8859-1 compatible.

So anyway, What is the difference between Windows-1252(1/3/4) and ISO-8859-1? And where does ANSI come into this anyway?

What encoding should we use on our Debian servers (and workstations) in order to ensure that clients get all information in the intended way and that we don't lose any chars on the way?

4条回答
【Aperson】
2楼-- · 2019-01-31 19:13

I'd like to answer this in a more web-like manner and in order to answer it so we need a little history. Joel Spolsky has written a very good introductionary article on the absolute minimum every dev should know on Unicode Character Encoding. Bear with me here because this is going to be somewhat of a looong answer. :)

As a history I'll point to some quotes from there: (Thank you very much Joel! :) )

The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes.

And all was good, assuming you were an English speaker. Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255.

So now "OEM character sets" were distributed with PCs and these were still all different and incompatible. And to our contemporary amazement - it was all fine! They didn't have the Internet back than and people rarely exchanged files between systems with different locales.

Joel goes on saying:

In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages.

And this is how the "Windows Code pages" were born, eventually. They were actually "parented" by the DOS code pages. And then Unicode was born! :) and UTF-8 is "another system for storing your string of Unicode code points" and actually "every code point from 0-127 is stored in a single byte" and is the same as ASCII. I will not go into anymore specifics of Unicode and UTF-8, but you should read up on the BOM, Endianness and Character Encoding as a general.

On "the ANSI conspiracy", Microsoft actually admits the miss-labeling of Windows-1252 in a glossary of terms:

The so-called Windows character set (WinLatin1, or Windows code page 1252, to be exact) uses some of those positions for printable characters. Thus, the Windows character set is NOT identical with ISO 8859-1. The Windows character set is often called "ANSI character set", but this is SERIOUSLY MISLEADING. It has NOT been approved by ANSI.

So, ANSI when refering to Windows character sets is not ANSI-certified! :)

As Jukka pointed out (credits go to you for the nice answer )

Windows-1252 ISO Latin 1, also known as ISO-8859-1 as a character encoding, so that the code range 0x80 to 0x9F is reserved for control characters in ISO-8859-1 (so-called C1 Controls), wheres in Windows-1252, some of the codes there are assigned to printable characters (mostly punctuation characters), others are left undefined.

However my personal opinion and technical understanding is that both Windows-1252 and ISO-8859-1 ARE NOT WEB ENCODINGS! :) So:

  • For web pages please use UTF-8 as encoding for the content So store data as UTF-8 and "spit it out" with the HTTP Header: Content-Type: text/html; charset=utf-8.

    There is also a thing called the HTML content-type meta-tag: <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> Now, what browsers actually do when they encounter this tag is that they start from the beginning of the HTML document again so that they could reinterpret the document in the declared encoding. This should happen only if there is no 'Content-type' header.

  • Use other specific encodings if the users of your system need files generated from it. For example some western users may need Excel generated files, or CSVs in Windows-1252. If this is the case, encode text in that locale and then store it on the fs and serve it as a download-able file.

  • There is another thing to be aware of in the design of HTTP: The content-encoding distribution mechanism should work like this.

    I. The client requests a web page in a specific content-types and encodings via: the 'Accept' and 'Accept-Charset' request headers.

    II. Then the server (or web application) returns the content trans-coded to that encoding and character set.

This is NOT THE CASE in most modern web apps. What actually happens it that web applications serve (force the client) content as UTF-8. And this works because browsers interpret received documents based on the response headers and not on what they actually expected.

We should all go Unicode, so please, please, please use UTF-8 to distribute your content wherever possible and most of all applicable. Or else the elders of the Internet will haunt you! :)

P.S. Some more nice articles on using MS Windows characters in Web Pages can be found here and here.

查看更多
乱世女痞
3楼-- · 2019-01-31 19:18

This table gives an overview about the differences. It shows all characters which are defined in Windows-1252 but not available in ISO-8859-1/ISO-8859-15:

        │  …0  │  …1  │  …2  │  …3  │  …4  │  …5  │  …6  │  …7  │  …8  │  …9  │  …A  │  …B  │  …C  │  …D  │  …E  │  …F  │
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     8… │   €  │      │   ‚  │   ƒ  │   „  │   …  │   †  │   ‡  │   ˆ  │   ‰  │   Š  │   ‹  │   Œ  │      │   Ž  │      │
Unicode │ 20AC │      │ 201A │ 0192 │ 201E │ 2026 │ 2020 │ 2021 │ 02C6 │ 2030 │ 0160 │ 2039 │ 0152 │      │ 017D │      │
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     9… │      │  ‘   │   ’  │   “  │   ”  │   •  │   –  │   —  │   ˜  │   ™  │   š  │   ›  │   œ  │      │   ž  │   Ÿ  │
Unicode │      │ 2018 │ 2019 │ 201C │ 201D │ 2022 │ 2013 │ 2014 │ 02DC │ 2122 │ 0161 │ 203A │ 0153 │      │ 017E │ 0178 │

Unlike Windows-1252 range 0x80…0x9F is used for Control Codes in ISO-8859-1.

This table shows the differences between Windows-1252, ISO-8859-1 and ISO-8859-15

Character    │    € │   Š │   š │   Ž │   ž │   Œ │   œ │   Ÿ │  ¤ │  ¦ │  ¨ │  ´ │  ¸ │  ¼ │  ½ │  ¾ │
───────────────────────────────────────────────────────────────────────────────────────────────────────
ISO 8859-1   │    – │   – │   – │   – │   – │   – │   – │   – │ A4 │ A6 │ A8 │ B4 │ B8 │ BC │ BD │ BE │
ISO 8859-15  │   A4 │  A6 │  A8 │  B4 │  B8 │  BC │  BD │  BE │  – │  – │  – │  – │  – │  – │  – │  – │
Windows-1252 │   80 │  8A │  9A │  8E │  9E │  8C │  9C │  9F │ A4 │ A6 │ A8 │ B4 │ B8 │ BC │ BD │ BE │
Unicode      │ 20AC │ 160 │ 161 │ 17D │ 17E │ 152 │ 153 │ 178 │ A4 │ A6 │ A8 │ B4 │ B8 │ BC │ BD │ BE │
查看更多
我欲成王,谁敢阻挡
4楼-- · 2019-01-31 19:22

The most authoritative reference to meanings of character encoding names is the IANA registry Character Sets.

Windows-1252 is commonly known as Windows Latin 1 or as Windows West European or something like that. It differs from ISO Latin 1, also known as ISO-8859-1 as a character encoding, so that the code range 0x80 to 0x9F is reserved for control characters in ISO-8859-1 (so-called C1 Controls), wheres in Windows-1252, some of the codes there are assigned to printable characters (mostly punctuation characters), others are left undefined.

ANSI comes here as a misnomer. Microsoft once submitted Windows-1252 to American National Standards Institute (ANSI) to be adopted as a standard; the proposal was rejected, but Microsoft still calls their code “ANSI”. For further confusion, they may use “ANSI” for different encodings (basically, the “native 8-bit encoding” of a Windows installation).

In the web context, declaring ISO-8859-1 will be taken as if you declared Windows-1252. The reason is that C1 Controls are not used, or useful, on the web, whereas the added characters are often used, even on pages mislabelled as ISO-8859-1. So in practical terms it does not matter which one you declare.

There might still be some browsers that actually interpret data as ISO-8859-1 if declared so, but they must be very rare (the last I remember seeing was a version of Opera about ten years ago).

You do not describe what problems you have encountered. The most common cause of problems seems to be that data is actually UTF-8 encoded but declared as ISO-8859-1 (or Windows-1252), or vice versa. This becomes a real problem to web page authors if a server forces a Content-Type header declaring a character encoding and it is one that they cannot deal with in their authoring environment (or don’t know how to do that).

查看更多
SAY GOODBYE
5楼-- · 2019-01-31 19:35

ANSI (Windows-1252) in countries with an english/latin alphabet, e.g. UK/US/France/Germany and others, refers to the Windows-1252 encoding. https://web.archive.org/web/20170916200715/http://www.microsoft.com:80/resources/msdn/goglobal/default.mspx

Windows-1252. and ISO-8859-1 are very similar. They only differ in 32 characters.

In Windows-1252, the characters from 128 to 159 are used for some useful characters such as the Euro symbol.

In ISO-8859-1 these characters are mapped to control characters which are useless in HTML.

__ so a suggestion so see if 128 is euro symbol.. if it is it's Windows 1252. __

The codes from 128 to 159 are not in use in ISO-8859-1, but many browsers will display the characters from the Windows-1252) character set instead of nothing.

These 2 links list them both.

http://www.w3schools.com/charsets/ref_html_ansi.asp

http://www.w3schools.com/charsets/ref_html_8859.asp

Some comments were very useful and I amended my post accordingly based on them.

Chenfeng points out On Windows, "ANSI" refers to the system codepage specified by the locale, whatever that is (Arabic/Chinese/Cyrillic/Vietnamese/...). It does not [necessarily] refer.. to Windows-1252. You can test this by changing your locale and then use notepad.exe to save a text file in "ANSI". According to this MS documentation, there are 14 different "ANSI" code pages https://docs.microsoft.com/en-us/windows/desktop/intl/code-page-identifiers

Wernfriend points out https://web.archive.org/web/20170916200715/http://www.microsoft.com:80/resources/msdn/goglobal/default.mspx and that usa codepage 437 is the 'OEM codepage', (see OEM column), and the OEM codepage is the one used by the cmd prompt. And he points out / suggests, showing from that webpage, that in many non-english/latin-alphabet speaking countries ansi is not windows 1252. I notice that for example, hebrew ansi uses 1255. (hebrew OEM codepage is 862).

查看更多
登录 后发表回答