Process huge GEOJson file with jq

2020-05-07 07:00发布

问题:

Given a GEOJson file as follows:-

{
  "type": "FeatureCollection",
  "features": [
   {
     "type": "Feature",
     "properties": {
     "FEATCODE": 15014
  },
  "geometry": {
    "type": "Polygon",
    "coordinates": [
     .....

I want to end up with the following:-

{
  "type": "FeatureCollection",
  "features": [
   {
     "tippecanoe : {"minzoom" : 13},
     "type": "Feature",
     "properties": {
     "FEATCODE": 15014
  },
  "geometry": {
    "type": "Polygon",
    "coordinates": [
     .....

ie. I have added the tippecanoe object to each feature in the array features

I can make this work with:-

 jq '.features[].tippecanoe.minzoom = 13' <GEOJSON FILE> > <OUTPUT FILE>

Which is fine for small files. But processing a large file of 414Mb seems to take forever with the processor maxing out and nothing being written to the OUTPUT FILE

Reading further into jq it appears that the --stream command line parameter may help but I am completely confused as to how to use this for my purposes.

I would be grateful for an example command line that serves my purposes along with an explanation as to what --stream is doing.

回答1:

A one-pass jq-only approach may require more RAM than is available. If that is the case, then a simple all-jq approach is shown below, together with a more economical approach based on using jq along with awk.

The two approaches are the same except for the reconstitution of the stream of objects into a single JSON document. This step can be accomplished very economically using awk.

In both cases, the large JSON input file with objects of the required form is assumed to be named input.json.

jq-only

jq -c  '.features[]' input.json |
    jq -c '.tippecanoe.minzoom = 13' |
    jq -c -s '{type: "FeatureCollection", features: .}'

jq and awk

jq -c '.features[]' input.json |
   jq -c '.tippecanoe.minzoom = 13' | awk '
     BEGIN {print "{\"type\": \"FeatureCollection\", \"features\": ["; }
     NR==1 { print; next }
           {print ","; print}
     END   {print "] }";}'

Performance comparison

For comparison, an input file with 10,000,000 objects in .features[] was used. Its size is about 1GB.

u+s:

jq-only:              15m 15s
jq-awk:                7m 40s
jq one-pass using map: 6m 53s


回答2:

An alternative solution could be for example:

jq '.features |= map_values(.tippecanoe.minzoom = 13)'

To test this, I created a sample JSON as

d = {'features': [{"type":"Feature", "properties":{"FEATCODE": 15014}} for i in range(0,N)]}

and inspected the execution time as a function of N. Interestingly, while the map_values approach seems to have linear complexity in N, .features[].tippecanoe.minzoom = 13 exhibits quadratic behavior (already for N=50000, the former method finishes in about 0.8 seconds, while the latter needs around 47 seconds)

Alternatively, one might just do it manually with, e.g., Python:

import json
import sys

data = {}
with open(sys.argv[1], 'r') as F:
    data = json.load(F)

extra_item = {"minzoom" : 13}
for feature in data['features']:
    feature["tippecanoe"] = extra_item

with open(sys.argv[2], 'w') as F:
    F.write(json.dumps(data))


回答3:

In this case, map rather than map_values is far faster (*):

.features |= map(.tippecanoe.minzoom = 13)

However, using this approach will still require enough RAM.

p.s. If you want to use jq to generate a large file for timing, consider:

def N: 1000000;

def data:
   {"features": [range(0;N) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };

(*) Using map, 20s for 100MB, and approximately linear.



回答4:

Here, based on the work of @nicowilliams at GitHub, is a solution that uses the streaming parser available with jq. The solution is very economical with memory, but is currently quite slow if the input is large.

The solution has two parts: a function for injecting the update into the stream produced using the --stream command-line option; and a function for converting the stream back to JSON in the original form.

Invocation:

jq -cnr --stream -f program.jq input.json

program.jq

# inject the given object into the stream produced from "inputs" with the --stream option
def inject(object):
  [object|tostream] as $object
  | 2
  | truncate_stream(inputs)
  | if (.[0]|length == 1) and length == 1
    then $object[]
    else .
    end ;

# Input: the object to be added
# Output: text
def output:
  . as $object
  | ( "[",
      foreach fromstream( inject($object) ) as $o
        (0;
         if .==0 then 1 else 2 end;
         if .==1 then $o else ",", $o end),
      "]" ) ;

{}
| .tippecanoe.minzoom = 13
| output

Generation of test data

def data(N):
 {"features":
  [range(0;2) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };

Example output

With N=2:

[
{"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
,
{"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
]