R: jsonlite's stream_out function producing in

2019-08-24 06:53发布

问题:

I'm trying to load a really big JSON file into R. Since the file is too big to fit into memory on my machine, I found that using the jsonlite package's stream_in/stream_out functions is really helpful. With these functions, I can subset the data first in chunks without loading it, write the subset data to a new, smaller JSON file, and then load that file as a data.frame. However, this intermediary JSON file is getting truncated (if that's the right term) while being written with stream_out. I will now attempt to explain with further detail.

What I'm attempting:

I have written my code like this (following an example from documentation):

con_out <- file(tmp <- tempfile(), open = "wb")
stream_in(file("C:/User/myFile.json"), handler = function(df){ 
      df <- df[which(df$Var > 0), ]
      stream_out(df, con_out, pagesize = 1000)
    }, pagesize = 5000)
myData <- stream_in(file(tmp))

As you can see, I open a connection to a temporary file, read my original JSON file with stream_in and have the handler function subset each chunk of data and write it to the connection.

The problem

This procedure runs without any problems, until I try to read it in myData <- stream_in(file(tmp)), upon which I receive an error. Manually opening the new, temporary JSON file reveals that the bottom-most line is always incomplete. Something like the following:

{"Var1":"some data","Var2":3,"Var3":"some othe

I then have to manually remove that last line after which the file loads without issue.

Solutions I've tried

  1. I've tried reading the documentation thoroughly and looking at the stream_out function, and I can't figure out what may be causing this issue. The only slight clue I have is that the stream_out function automatically closes the connection upon completion, so maybe it's closing the connection while some other component is still writing?

  2. I inserted a print function to print the tail() end of the data.frame at every chunk inside the handler function to rule out problems with the intermediary data.frame. The data.frame is produced flawlessly at every interval, and I can see that the final two or three rows of the data.frame are getting truncated while being written to file (i.e., they're not being written). Notice that it's the very end of the entire data.frame (after stream_out has rbinded everything) that is getting chopped.

  3. I've tried playing around with the pagesize arguments, including trying very large numbers, no number, and Inf. Nothing has worked.

  4. I can't use jsonlite's other functions like fromJSON because the original JSON file is too large to read without streaming and it is actually in minified(?)/ndjson format.

System info

I'm running R 3.3.3 x64 on Windows 7 x64. 6 GB of RAM, AMD Athlon II 4-Core 2.6 Ghz.

Treatment

I can still deal with this issue by manually opening the JSON files and correcting them, but it's leading to some data loss and it's not allowing my script to be automated, which is an inconvenience as I have to run it repeatedly throughout my project.

I really appreciate any help with this; thank you.

回答1:

I believe this does what you want, it is not necessary to do the extra stream_out/stream_in.

myData <- new.env()
stream_in(file("MOCK_DATA.json"), handler = function(df){ 
  idx <- as.character(length(myData) + 1)
  myData[[idx]] <- df[which(df$id %% 2 == 0), ] ## change back to your filter
}, pagesize = 200) ## change back to 1000
myData <- myData %>% as.list() %>% bind_rows()

(I created some mock data in Mockaroo: generated 1000 lines, hence the small pagesize, to check if everything worked with more than one chunk. The filter I used was even IDs because I was lazy to create a Var column.)