I'm trying to load a really big JSON file into R. Since the file is too big to fit into memory on my machine, I found that using the jsonlite
package's stream_in
/stream_out
functions is really helpful. With these functions, I can subset the data first in chunks without loading it, write the subset data to a new, smaller JSON file, and then load that file as a data.frame
. However, this intermediary JSON file is getting truncated (if that's the right term) while being written with stream_out
. I will now attempt to explain with further detail.
What I'm attempting:
I have written my code like this (following an example from documentation):
con_out <- file(tmp <- tempfile(), open = "wb")
stream_in(file("C:/User/myFile.json"), handler = function(df){
df <- df[which(df$Var > 0), ]
stream_out(df, con_out, pagesize = 1000)
}, pagesize = 5000)
myData <- stream_in(file(tmp))
As you can see, I open a connection to a temporary file, read my original JSON file with stream_in
and have the handler
function subset each chunk of data and write it to the connection.
The problem
This procedure runs without any problems, until I try to read it in myData <- stream_in(file(tmp))
, upon which I receive an error. Manually opening the new, temporary JSON file reveals that the bottom-most line is always incomplete. Something like the following:
{"Var1":"some data","Var2":3,"Var3":"some othe
I then have to manually remove that last line after which the file loads without issue.
Solutions I've tried
I've tried reading the documentation thoroughly and looking at the
stream_out
function, and I can't figure out what may be causing this issue. The only slight clue I have is that thestream_out
function automatically closes the connection upon completion, so maybe it's closing the connection while some other component is still writing?I inserted a print function to print the
tail()
end of thedata.frame
at every chunk inside thehandler
function to rule out problems with the intermediarydata.frame
. Thedata.frame
is produced flawlessly at every interval, and I can see that the final two or three rows of thedata.frame
are getting truncated while being written to file (i.e., they're not being written). Notice that it's the very end of the entiredata.frame
(afterstream_out
hasrbind
ed everything) that is getting chopped.I've tried playing around with the
pagesize
arguments, including trying very large numbers, no number, andInf
. Nothing has worked.I can't use
jsonlite
's other functions likefromJSON
because the original JSON file is too large to read without streaming and it is actually in minified(?)/ndjson
format.
System info
I'm running R 3.3.3 x64 on Windows 7 x64. 6 GB of RAM, AMD Athlon II 4-Core 2.6 Ghz.
Treatment
I can still deal with this issue by manually opening the JSON files and correcting them, but it's leading to some data loss and it's not allowing my script to be automated, which is an inconvenience as I have to run it repeatedly throughout my project.
I really appreciate any help with this; thank you.
I believe this does what you want, it is not necessary to do the extra
stream_out/stream_in
.(I created some mock data in Mockaroo: generated 1000 lines, hence the small pagesize, to check if everything worked with more than one chunk. The filter I used was even IDs because I was lazy to create a
Var
column.)