How can I concat a list of JSON files into a huge JSON array? I've 5000 files and 550 000 list items.
My fist try was to use jq, but it looks like jq -s is not optimized for a large input.
jq -s -r '[.[][]]' *.js
This command works, but takes way too long to complete and I really would like to solve this with Python.
Here is my current code:
def concatFiles(outName, inFileNames):
def listGenerator():
for inName in inFileNames:
with open(inName, 'r') as f:
for item in json.load(f):
yield item
with open(outName, 'w') as f:
json.dump(listGenerator(), f)
I'm getting:
TypeError: <generator object listGenerator at 0x7f94dc2eb3c0> is not JSON serializable
Any attempt load all files into ram will trigger the OOM-killer of Linux. Do you have any ideas?
You should derive from
list
and override__iter__
method.Result is
[1, [1, 2, 3], [20, 30, 40]]
.As of simplejson 3.8.0, you can use the
iterable_as_array
option to make any iterable serializable into an arrayresult is
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Based on the accepted answer, here is the StreamArray I eventually went for. It contains two lies:
self.__tail__
might be immutablelen(StreamArray(some_gen))
is either 0 or 1.
Single use only!
A complete simple readable solution that can serialize a generator from a normal or empty iterable, can work with .encode() or .iterencode(). Written tests. Tested with Python 2.7, 3.0, 3.3, 3.6
Used solutions: Vadim Pushtaev (incomplete), user1158559 (unnecessarily complicated) and Claude (in another question, also complicated).
Useful simplification are:
__init__
because we can expect that the SerializableGenerator can be called immediately before json.dumps. (against user1158559 solution)__repr__
. It is better to store the generator also to the list to provide meaningful results like[<generator object ...>]
. (against Claude). Default methods__len__
and__bool__
works now correctly to recognize an empty and not empty object.An advantage of this solution is that a standard JSON serializer can be used without params. If nested generators should be supported or if encapsulation by
SerializableGenerator(iterator)
is undesirable then I recommend IterEncoder answer.