How to execute OpenRefine JSON on CSV in Python?

I am trying to find a Python solution which can execute the following OpenRefine Python commands in JSON without OpenRefine server being on. My OpenRefine JSON contains mappings and custom Python commands on each field of any properly formatted CSV file, so this is not a basic JSON reading. One example OpenRefine JSON code where only regex mappings

[
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column Sleep using expression jython:import re\n\nvalue = re.sub(\"h0\", \"h\",value)\n\nvalue = re.sub(\"h\",\"*60+\", value)\n\nreturn eval(value)\n\n \nreturn eval(value.replace(\"h\", \"*60+\"));",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "Sleep",
    "expression": "jython:import re\n\nvalue = re.sub(\"h0\", \"h\",value)\n\nvalue = re.sub(\"h\",\"*60+\", value)\n\nreturn eval(value)\n\n \nreturn eval(value.replace(\"h\", \"*60+\"));",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  }
]

One solution is to process JSON one by one with each type of element but there may be easier solutions with some packages.

Python: 3.5.2
OS: Debian 9

标签： python json csv openrefine

4条回答

贼婆χ

2楼-- · 2019-04-13 00:02

The pyrefine project aims to do exactly that. But it is still a work in progress, very few operations are supported. Contributors are welcome!

0人赞添加讨论(0) 举报

叛逆

3楼-- · 2019-04-13 00:18

Here are some additional projects:

https://github.com/PaulMakepeace/refine-client-py : This is a little old though
https://github.com/opencultureconsulting/openrefine-client : This is a fork of the previous project. It extends the command line interface (CLI) and supports communication between docker containers

0人赞添加讨论(0) 举报

可以哭但决不认输i

4楼-- · 2019-04-13 00:24

I do not know any ready-made solution and I have no idea how to do that. A workaround might be to try to transform the Jython scripts into functions, and then apply them to your csv using pandas.

import json
import re

with open("your open refine json", "r") as infile:
    data = json.load(infile)

with open("result.py", 'w') as outfile:
    for el in data:
        count = 1
        column=el['columnName']
        expression = el['expression'].replace("jython:", "")
        expression = re.sub(r"\n\n\s?\n?", "\n    ", expression)
        expression = re.sub(r";$", "", expression)
        result = """def function%s(value):\n    %s""" %(count, expression)
        count+=1
        outfile.write(result + 
                      "\n\n" + 
                      "mycsv['%s'] = mycsv['%s'].apply(lambda x: function%s(x))" %(column, column, count-1))

result on your Json:

def function1(value):
    import re
    value = re.sub("h0", "h",value)
    value = re.sub("h","*60+", value)
    return eval(value)
    return eval(value.replace("h", "*60+")) # is this second return an error ?

mycsv['Sleep'] = mycsv['Sleep'].apply(lambda x: function1(x))

Finally, you can add these lines on the top of result.py, launch the script, and pray...

import pandas as pd 

mycsv = pd.read_csv("your csv", encoding = "utf8")

0人赞添加讨论(0) 举报

【Aperson】

5楼-- · 2019-04-13 00:25

I have found variety of different alternatives. I try to judge them to other alternatives. Pyrefine is so far the only genuine Python solution.

Alternatives

I. A partial solution here to create a dictionary in R with Python to do the conversions. This does not implement GREPL edits, Jython/Python edits or Closure edits.

#!/usr/bin/env python2
#
# Description
#      This builds a dictionary-style structure to R with Python
#      to do the JSON edits on other data, with only `Cluster-edit` support.    
#
#      The original source is (1) on which I have done some remarks.
#
# Further reading
#
# (1) Original source of the code, https://medium.com/optima-blog/semi-automated-text-cleaning-in-r-68054a9491da
import json
import sys
import os
if len(sys.argv) < 2:
 print “USAGE: ./utils/open_refine_to_R.py [edits.json] > r_file.R”
 exit(1)
json_file = sys.argv[-1]
#conversions = json.load(open(“state_clustering.json”))
conversions = json.load(open(json_file))
function_name = os.path.splitext(os.path.basename(json_file))[0]
print “%s = function(df) {“ %function_name
for conv in conversions:

  #THIS WILL fire ERRORS WITHOUT try-catch eg. with regexes
  edits = conv[‘edits’]

  columnName = str(conv[‘columnName’])
  for edit in edits:
    froms = edit[‘from’]
    to = edit[‘to’]
    for source in froms:
      source = str(source)
      to = str(to)
      print “ df[df[, %s] == %s, %s] = %s” %(repr(columnName),
        repr(source), repr(columnName), repr(to))
print “ df”
print “}”

where the output could be edited to Python format.

II. P3-batchrefine is mostly coded in Java but some Python. It lets you do the transformations in the following way (not a genuine Python solution unless you are fine in calling external Java libraries).

./bin/batchrefine remote input.csv transform.json > output.csv

III. Pyrefine is a genuine python solution and it aims to work in the following way, copied from its docs:

import pyrefine
import pandas as pd

with open('script.json') as script_file:
    script = pyrefine.parse(script_file)

input_data = pd.read_csv('input.csv')
output_data = script.execute(input_data)

Further Information on parsing OpenRefine JSON

Trying to parse a Json with Open Refine GREL

Parse JSON in Google Refine

Best way to parse a big and intricated Json file with OpenRefine (or R)

0人赞添加讨论(0) 举报