I'm dealing with an API that unfortunately is returning malformed (or "weirdly formed," rather -- thanks @fjarri) JSON, but on the positive side I think it may be an opportunity for me to learn something about recursion as well as JSON. It's for an app I use to log my workouts, I'm trying to make a backup script.
I can received the JSON fine, but even after requests.get(api_url).json()
(or json.loads(requests.get(api_url).text)
), one of the values is still a JSON encoded string. Luckily, I can just json.loads()
the string and it properly decodes to a dict. The specific key is predictable: timezone_id
, whereas its value varies (because data has been logged in multiple timezones). For example, after decoding, it might be: dump
ed to file as "timezone_id": {\"name\":\"America/Denver\",\"seconds\":\"-21600\"}"
, or load
ed into Python as 'timezone_id': '{"name":"America/Denver","seconds":"-21600"}'
The problem is that I'm using this API to retrieve a fair amount of data, which has several layers of dicts and lists, and the double encoded timezone_id
s occur at multiple levels.
Here's my work so far with some example data, but it seems like I'm pretty far off base.
#! /usr/bin/env python3
import json
from pprint import pprint
my_input = r"""{
"hasMore": false,
"checkins": [
{
"timestamp": 1353193745000,
"timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
"privacy_groups": [
"private"
],
"meta": {
"client_version": "3.0",
"uuid": "fake_UUID"
},
"client_id": "fake_client_id",
"workout_name": "Workout (Nov 17, 2012)",
"fitness_workout_json": {
"exercise_logs": [
{
"timestamp": 1353195716000,
"type": "exercise_log",
"timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
"workout_log_uuid": "fake_UUID"
},
{
"timestamp": 1353195340000,
"type": "exercise_log",
"timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
"workout_log_uuid": "fake_UUID"
}
]
},
"workout_uuid": ""
},
{
"timestamp": 1354485615000,
"user_id": "fake_ID",
"timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
"privacy_groups": [
"private"
],
"meta": {
"uuid": "fake_UUID"
},
"created": 1372023457376,
"workout_name": "Workout (Dec 02, 2012)",
"fitness_workout_json": {
"exercise_logs": [
{
"timestamp": 1354485615000,
"timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
"workout_log_uuid": "fake_UUID"
},
{
"timestamp": 1354485584000,
"timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
"workout_log_uuid": "fake_UUID"
}
]
},
"workout_uuid": ""
}]}"""
def recurse(obj):
if isinstance(obj, list):
for item in obj:
return recurse(item)
if isinstance(obj, dict):
for k, v in obj.items():
if isinstance(v, str):
try:
v = json.loads(v)
except ValueError:
pass
obj.update({k: v})
elif isinstance(v, (dict, list)):
return recurse(v)
pprint(json.loads(my_input, object_hook=recurse))
Any suggestions for a good way to json.loads()
all those double-encoded values without changing the rest of the object? Many thanks in advance!
This post seems to be a good reference: Modifying Deeply-Nested Structures
Edit: This was flagged as a possible duplicate of this question -- I think its fairly different, as I've already demonstrated that using json.loads()
was not working. The solution ended up requiring an object_hook
, which I've never had to use when decoding json and is not addressed in the prior question.
So, the
object_hook
in the json loader is going to be called each time the json loader is finished constructing a dictionary. That is, the first thing it is called on is the inner-most dictionary, working outwards.The dictionary that the
object_hook
callback is given is replaced by what that function returns.So, you don't need to recurse yourself. The loader is giving you access to the inner-most things first by its nature.
I think this will work for you:
It seems to have the effect I think you're looking for when I test it.
I probably wouldn't try to decode every string value -- I would strategically just call it where you expect there to be a json object double encoding to exist. If you try to decode every string, you might accidentally decode something that is supposed to be a string (like the string
"12345"
when that is intended to be a string returned by the API).Also, your existing function is more complicated than it needs to be, might work as-is if you always returned
obj
(whether you update its contents or not).Your main issue is that your
object_hook
function should not be recursing.json.loads()
takes care of the recursing itself and calls your function every time it finds a dictionary (akaobj
will always be a dictionary). So instead you just want to modify the problematic keys and return the dict -- this should do what you are looking for:However, if you know the problematic (double-encoded) entry always take on a specific form (e.g.
key == 'timezone_id'
) it is probably safer to just calljson.loads()
on those keys only, as Matt Anderson suggests in his answer.