Google Calculator Thousands Separator Special Char

2019-07-10 02:11发布

问题:

NOTE: For more answers related to this, please see Special Characters in Google Calculator

I noticed when grabbing the return value for a Google Calculator calculation, the thousands place is separated by a rather odd character. It is not simply a space.

Let's take the example of converting $4,000 USD to GBP.

If you visit the following Google link:

http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp

You'll note that the response is:

{lhs: "4000 U.S. dollars",rhs: "2 497.81441 British pounds",error: "",icc: true}

This looks reasonable, and the thousands place appears to be separated by a whitespace character.

However, if you enter the following into your command line:

curl -s "http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp"

You'll note that the response is:

{lhs: "4000 U.S. dollars",rhs: "2?498.28243 British pounds",error: "",icc: true}

That question mark (?) is a replacement character. What is going on?

AppleScript returns a different replacement character:

{lhs: "4000 U.S. dollars",rhs: "2†498.28243 British pounds",error: "",icc: true}

I am also getting from other sources:

{lhs: "4000 U.S. dollars",rhs: "2�498.28243 British pounds",error: "",icc: true}

It turns out that � is the proper Unicode replacement character 65533.

Can anyone give me insight into what Google is passing me?

回答1:

It's a non-breaking space, U+00A0. It's to ensure that the number won't get broken at the end of a line.

Google returns the correct encoding (UTF-8) however:

Content-Type: text/html; charset=UTF-8

so ...

  • if it comes out as a normal space (U+0020) instead (Firefox does that when copying, stupidly enough), then the application performs conversion of certain characters to lookalikes, maybe to fit in some sort of restricted code page (ASCII perhaps).
  • if there is a question mark, then it was correctly read as Unicode but some part in processing uses a legacy character set that doesn't contain that character so it gets converted.
  • if there is a replacement character � (U+FFFD) then it was likely read as UTF-8, converted into a legacy character set that contains the character (e.g. Latin 1) and then re-interpreted as UTF-8.
  • if there is a totally different character, such as your dagger (†), then I'd guess the response is read correctly as Unicode, gets converted to a character set that contains the character and re-interpreted in another character set. A quick look at the Mac Roman codepage reveals that A0 indeed maps to †.

Needless to say, some parts in whatever you use in processing that response seem to be horrible broken in regard to Unicode. Something I'd hope wouldn't really happen that often in this millennium, but apparently it still does.


I figured out what it was by fiddling around in PowerShell a bit:

PS Home:\> $wc = new-object net.webclient
PS Home:\> $x = $wc.downloadstring('http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp')
PS Home:\> [char[]]$x|%{"$_ - " + +$_}
...
" - 34
2 - 50
  - 160
4 - 52
9 - 57
8 - 56
. - 46
2 - 50
8 - 56
2 - 50
4 - 52
...

Also a quick look at the response headers revealed that the encoding is set correctly.



回答2:

According to my tests with curl in the Terminal on OSX, by changing the International character encoding in the Terminal preferences : The encoding is iso latin 1.

When I set the encoding to UTF8 : I get "2?498.28243"

When I set the encoding to MacRoman : I get "2†498.28243"

First solution : use a user agent from any browser (Safari on OSX 10.6.8 in this example)

curl -s -A 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.48 (KHTML, like Gecko) Version/5.1 Safari/534.48' 'http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp'

Second solution : use iconv

curl -s 'http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp' |  iconv -t utf8 -f  iso-8859-1


回答3:

Try

set myUrl to quoted form of "http://www.google.com/ig/calculator?hl=en&q=4000%20usd%20to%20gbp"
set xxx to do shell script "curl " & myUrl & " | sed 's/[†]/,/'"