How to execute OpenRefine JSON on CSV in Python?

2019-04-12 23:33发布

I am trying to find a Python solution which can execute the following OpenRefine Python commands in JSON without OpenRefine server being on. My OpenRefine JSON contains mappings and custom Python commands on each field of any properly formatted CSV file, so this is not a basic JSON reading. One example OpenRefine JSON code where only regex mappings

[
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column Sleep using expression jython:import re\n\nvalue = re.sub(\"h0\", \"h\",value)\n\nvalue = re.sub(\"h\",\"*60+\", value)\n\nreturn eval(value)\n\n \nreturn eval(value.replace(\"h\", \"*60+\"));",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "Sleep",
    "expression": "jython:import re\n\nvalue = re.sub(\"h0\", \"h\",value)\n\nvalue = re.sub(\"h\",\"*60+\", value)\n\nreturn eval(value)\n\n \nreturn eval(value.replace(\"h\", \"*60+\"));",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  }
]

One solution is to process JSON one by one with each type of element but there may be easier solutions with some packages.

Python: 3.5.2
OS: Debian 9

4条回答
贼婆χ
2楼-- · 2019-04-13 00:02

The pyrefine project aims to do exactly that. But it is still a work in progress, very few operations are supported. Contributors are welcome!

查看更多
叛逆
3楼-- · 2019-04-13 00:18

Here are some additional projects:

查看更多
可以哭但决不认输i
4楼-- · 2019-04-13 00:24

I do not know any ready-made solution and I have no idea how to do that. A workaround might be to try to transform the Jython scripts into functions, and then apply them to your csv using pandas.

import json
import re

with open("your open refine json", "r") as infile:
    data = json.load(infile)

with open("result.py", 'w') as outfile:
    for el in data:
        count = 1
        column=el['columnName']
        expression = el['expression'].replace("jython:", "")
        expression = re.sub(r"\n\n\s?\n?", "\n    ", expression)
        expression = re.sub(r";$", "", expression)
        result = """def function%s(value):\n    %s""" %(count, expression)
        count+=1
        outfile.write(result + 
                      "\n\n" + 
                      "mycsv['%s'] = mycsv['%s'].apply(lambda x: function%s(x))" %(column, column, count-1))

result on your Json:

def function1(value):
    import re
    value = re.sub("h0", "h",value)
    value = re.sub("h","*60+", value)
    return eval(value)
    return eval(value.replace("h", "*60+")) # is this second return an error ?

mycsv['Sleep'] = mycsv['Sleep'].apply(lambda x: function1(x))

Finally, you can add these lines on the top of result.py, launch the script, and pray...

import pandas as pd 

mycsv = pd.read_csv("your csv", encoding = "utf8")
查看更多
【Aperson】
5楼-- · 2019-04-13 00:25

I have found variety of different alternatives. I try to judge them to other alternatives. Pyrefine is so far the only genuine Python solution.

Alternatives

I. A partial solution here to create a dictionary in R with Python to do the conversions. This does not implement GREPL edits, Jython/Python edits or Closure edits.

#!/usr/bin/env python2
#
# Description
#      This builds a dictionary-style structure to R with Python
#      to do the JSON edits on other data, with only `Cluster-edit` support.    
#
#      The original source is (1) on which I have done some remarks.
#
# Further reading
#
# (1) Original source of the code, https://medium.com/optima-blog/semi-automated-text-cleaning-in-r-68054a9491da
import json
import sys
import os
if len(sys.argv) < 2:
 print “USAGE: ./utils/open_refine_to_R.py [edits.json] > r_file.R”
 exit(1)
json_file = sys.argv[-1]
#conversions = json.load(open(“state_clustering.json”))
conversions = json.load(open(json_file))
function_name = os.path.splitext(os.path.basename(json_file))[0]
print “%s = function(df) {“ %function_name
for conv in conversions:

  #THIS WILL fire ERRORS WITHOUT try-catch eg. with regexes
  edits = conv[‘edits’]

  columnName = str(conv[‘columnName’])
  for edit in edits:
    froms = edit[‘from’]
    to = edit[‘to’]
    for source in froms:
      source = str(source)
      to = str(to)
      print “ df[df[, %s] == %s, %s] = %s” %(repr(columnName),
        repr(source), repr(columnName), repr(to))
print “ df”
print “}”

where the output could be edited to Python format.

II. P3-batchrefine is mostly coded in Java but some Python. It lets you do the transformations in the following way (not a genuine Python solution unless you are fine in calling external Java libraries).

./bin/batchrefine remote input.csv transform.json > output.csv

III. Pyrefine is a genuine python solution and it aims to work in the following way, copied from its docs:

import pyrefine
import pandas as pd

with open('script.json') as script_file:
    script = pyrefine.parse(script_file)

input_data = pd.read_csv('input.csv')
output_data = script.execute(input_data)

Further Information on parsing OpenRefine JSON

  1. Trying to parse a Json with Open Refine GREL

  2. Parse JSON in Google Refine

  3. Best way to parse a big and intricated Json file with OpenRefine (or R)

查看更多
登录 后发表回答