BOM in server response screws up json parsing

I'm trying to write a Python script that posts some JSON to a web server and gets some JSON back. I patched together a few different examples on StackOverflow, and I think I have something that's mostly working.

import urllib2
import json

url = "http://foo.com/API.svc/SomeMethod"
payload = json.dumps( {'inputs': ['red', 'blue', 'green']} )
headers = {"Content-type": "application/json;"}

req = urllib2.Request(url, payload, headers)
f = urllib2.urlopen(req)
response = f.read()
f.close()

data = json.loads(response) # <-- Crashes

The last line throws an exception:

ValueError: No JSON object could be decoded

When I look at response, I see valid JSON, but the first few characters are a BOM:

>>> response
'\xef\xbb\xbf[\r\n  {\r\n    ... Valid JSON here

So, if I manually strip out the first three bytes:

data = json.loads(response[3::])

Everything works and response is turned into a dictionary.

My Question:

It seems kinda silly that json barfs when you give it a BOM. Is there anything different I can do with urllib or the json library to let it know this is a UTF8 string and to handle it as such? I don't want to manually strip out the first 3 bytes.

标签： python json urllib2 urllib

3条回答

时光不老，我们不散

2楼-- · 2019-02-06 21:00

In case I'm not the only one who experienced the same problem, but is using requests module instead of urllib2, here is a solution that works in Python 2.6 as well as 3.3:

import requests
r = requests.get(url, params=my_dict, auth=(user, pass))
print(r.headers['content-type'])  # 'application/json; charset=utf8'
if r.text[0] == u'\ufeff':  # bytes \xef\xbb\xbf in utf-8 encoding
    r.encoding = 'utf-8-sig'
print(r.json())

0人赞添加讨论(0) 举报

可以哭但决不认输i

3楼-- · 2019-02-06 21:03

You should probably yell at whoever's running this service, because a BOM on UTF-8 text makes no sense. The BOM exists to disambiguate byte order, and UTF-8 is defined as being little-endian.

That said, ideally you should decode bytes before doing anything else with them. Luckily, Python has a codec that recognizes and removes the BOM: utf-8-sig.

>>> '\xef\xbb\xbffoo'.decode('utf-8-sig')
u'foo'

So you just need:

data = json.loads(response.decode('utf-8-sig'))

0人赞添加讨论(0) 举报

放荡不羁爱自由

4楼-- · 2019-02-06 21:06

Since I lack enough reputation for a comment, I'll write an answer instead.

I usually encounter that problem when I need to leave the underlying Stream of a StreamWriter open. However, the overload that has the option to leave the underlying Stream open needs an encoding (which will be UTF8 in most cases), here's how to do it without emitting the BOM.

/* Since Encoding.UTF8 (the one you'd normally use in those cases) **emits**
 * the BOM, use whats below instead!
 */

// UTF8Encoding has an overload which enables / disables BOMs in the output
UTF8Encoding encoding = new UTF8Encoding(false);

using (MemoryStream ms = new MemoryStream())
using (StreamWriter sw = new StreamWriter(ms, encoding, 4096, true))
using (JsonTextWriter jtw = new JsonTextWriter(sw))
{
    serializer.Serialize(jtw, myObject);
}

0人赞添加讨论(0) 举报

BOM in server response screws up json parsing

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间