When screen-scraping some website, I extract data from <script>
tags.
The data I get is not in standard JSON
format. I cannot use json.loads()
.
# from
js_obj = '{x:1, y:2, z:3}'
# to
py_obj = {'x':1, 'y':2, 'z':3}
Currently, I use regex
to transform the raw data to JSON
format.
But I feel pretty bad when I encounter complicated data structure.
Do you have some better solutions?
demjson.decode()
import demjson
# from
js_obj = '{x:1, y:2, z:3}'
# to
py_obj = demjson.decode(js_obj)
jsonnet.evaluate_snippet()
import json, _jsonnet
# from
js_obj = '{x:1, y:2, z:3}'
# to
py_obj = json.loads(_jsonnet.evaluate_snippet('snippet', js_obj))
ast.literal_eval()
import ast
# from
js_obj = "{'x':1, 'y':2, 'z':3}"
# to
py_obj = ast.literal_eval(js_obj)
This will likely not work everywhere, but as a start, here's a simple regex that should convert the keys into quoted strings so you can pass into json.loads. Or is this what you're already doing?
In[70] : quote_keys_regex = r'([\{\s,])(\w+)(:)'
In[71] : re.sub(quote_keys_regex, r'\1"\2"\3', js_obj)
Out[71]: '{"x":1, "y":2, "z":3}'
In[72] : js_obj_2 = '{x:1, y:2, z:{k:3,j:2}}'
Int[73]: re.sub(quote_keys_regex, r'\1"\2"\3', js_obj_2)
Out[73]: '{"x":1, "y":2, "z":{"k":3,"j":2}}'
Simply:
import json
py_obj = json.loads(js_obj_stringified)
Above is the Python portion of the code. In javascript portion of the code:
js_obj_stringified = JSON.stringify(data);
JSON.stringify turns a Javascript object into JSON text and stores that JSON text in a string. It is a safe way to pass (via POST/GET) a javascript object to python to process.