Python decode nested JSON in JSON

2020-05-29 05:59发布

I'm dealing with an API that unfortunately is returning malformed (or "weirdly formed," rather -- thanks @fjarri) JSON, but on the positive side I think it may be an opportunity for me to learn something about recursion as well as JSON. It's for an app I use to log my workouts, I'm trying to make a backup script.

I can received the JSON fine, but even after requests.get(api_url).json() (or json.loads(requests.get(api_url).text)), one of the values is still a JSON encoded string. Luckily, I can just json.loads() the string and it properly decodes to a dict. The specific key is predictable: timezone_id, whereas its value varies (because data has been logged in multiple timezones). For example, after decoding, it might be: dumped to file as "timezone_id": {\"name\":\"America/Denver\",\"seconds\":\"-21600\"}", or loaded into Python as 'timezone_id': '{"name":"America/Denver","seconds":"-21600"}'

The problem is that I'm using this API to retrieve a fair amount of data, which has several layers of dicts and lists, and the double encoded timezone_ids occur at multiple levels.

Here's my work so far with some example data, but it seems like I'm pretty far off base.

#! /usr/bin/env python3

import json
from pprint import pprint

my_input = r"""{
    "hasMore": false,
    "checkins": [
        {
            "timestamp": 1353193745000,
            "timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
            "privacy_groups": [
                "private"
            ],
            "meta": {
                "client_version": "3.0",
                "uuid": "fake_UUID"
            },
            "client_id": "fake_client_id",
            "workout_name": "Workout (Nov 17, 2012)",
            "fitness_workout_json": {
                "exercise_logs": [
                    {
                        "timestamp": 1353195716000,
                        "type": "exercise_log",
                        "timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
                        "workout_log_uuid": "fake_UUID"
                    },
                    {
                        "timestamp": 1353195340000,
                        "type": "exercise_log",
                        "timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
                        "workout_log_uuid": "fake_UUID"
                    }
                ]
            },
            "workout_uuid": ""
        },
        {
            "timestamp": 1354485615000,
            "user_id": "fake_ID",
            "timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
            "privacy_groups": [
                "private"
            ],
            "meta": {
                "uuid": "fake_UUID"
            },
            "created": 1372023457376,
            "workout_name": "Workout (Dec 02, 2012)",
            "fitness_workout_json": {
                "exercise_logs": [
                    {
                        "timestamp": 1354485615000,
                        "timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
                        "workout_log_uuid": "fake_UUID"
                    },
                    {
                        "timestamp": 1354485584000,
                        "timezone_id": "{\"name\":\"America/Denver\",\"seconds\":\"-21600\"}",
                        "workout_log_uuid": "fake_UUID"
                    }
                ]
            },
            "workout_uuid": ""
        }]}"""

def recurse(obj):
    if isinstance(obj, list):
        for item in obj:
            return recurse(item)
    if isinstance(obj, dict):
        for k, v in obj.items():
            if isinstance(v, str):
                try:
                    v = json.loads(v)
                except ValueError:
                    pass
                obj.update({k: v})
            elif isinstance(v, (dict, list)):
                return recurse(v)

pprint(json.loads(my_input, object_hook=recurse))

Any suggestions for a good way to json.loads() all those double-encoded values without changing the rest of the object? Many thanks in advance!

This post seems to be a good reference: Modifying Deeply-Nested Structures

Edit: This was flagged as a possible duplicate of this question -- I think its fairly different, as I've already demonstrated that using json.loads() was not working. The solution ended up requiring an object_hook, which I've never had to use when decoding json and is not addressed in the prior question.

2条回答
ゆ 、 Hurt°
2楼-- · 2020-05-29 06:20

So, the object_hook in the json loader is going to be called each time the json loader is finished constructing a dictionary. That is, the first thing it is called on is the inner-most dictionary, working outwards.

The dictionary that the object_hook callback is given is replaced by what that function returns.

So, you don't need to recurse yourself. The loader is giving you access to the inner-most things first by its nature.

I think this will work for you:

def hook(obj):
    value = obj.get("timezone_id")
    # this is python 3 specific; I would check isinstance against 
    # basestring in python 2
    if value and isinstance(value, str):
        obj["timezone_id"] = json.loads(value, object_hook=hook)
    return obj
data = json.loads(my_input, object_hook=hook)

It seems to have the effect I think you're looking for when I test it.

I probably wouldn't try to decode every string value -- I would strategically just call it where you expect there to be a json object double encoding to exist. If you try to decode every string, you might accidentally decode something that is supposed to be a string (like the string "12345" when that is intended to be a string returned by the API).

Also, your existing function is more complicated than it needs to be, might work as-is if you always returned obj (whether you update its contents or not).

查看更多
The star\"
3楼-- · 2020-05-29 06:37

Your main issue is that your object_hook function should not be recursing. json.loads() takes care of the recursing itself and calls your function every time it finds a dictionary (aka obj will always be a dictionary). So instead you just want to modify the problematic keys and return the dict -- this should do what you are looking for:

def flatten_hook(obj):
    for key, value in obj.iteritems():
        if isinstance(value, basestring):
            try:
                obj[key] = json.loads(value, object_hook=flatten_hook)
            except ValueError:
                pass
    return obj

pprint(json.loads(my_input, object_hook=flatten_hook))

However, if you know the problematic (double-encoded) entry always take on a specific form (e.g. key == 'timezone_id') it is probably safer to just call json.loads() on those keys only, as Matt Anderson suggests in his answer.

查看更多
登录 后发表回答