Problems reading JSON file in R

2019-07-06 21:45发布

问题:

I have a JSON file (an export from mongoDB) that I'd like to load into R. The document is about 890 MB in size with roughly 63,000 rows of 12 fields. The fields are numeric, character and date. I'd like to end up with a 63000 x 12 data frame.

lines <-  readLines("fb2013.json")

result: jFile has all 63,000 elements in char class and all fields are lumped into one field.

Each file looks something like this:

"{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }"

Using rjson,

jFile <- fromJSON(paste(readLines("fb2013.json"), collapse=""))

only the first row is read into jFile but there are 12 fields.

Using RJSONIO:

jFile <- fromJSON(lines)

results in the following:

Warning messages:
1: In if (is.na(encoding)) return(0L) :
  the condition has length > 1 and only the first element will be used

Again, only the first row is read into jFile and there are 12 fields.

The output from rjson and RJSONIO looks something like this:

$`_id`
[1] "1018535"

$comments_count
[1] 0

$created_at
       $date 
1.357027e+12 

$icon
[1] "http://blah.gif"

$likes_count
[1] 20

$link
[1] "http://www.chachacha"

$message
[1] "I'd love to figure this out."

$page_category
[1] "Internet/software"

$page_id
[1] "3924395872345878534"

$page_name
[1] "Not Entirely Hopeless"

$type
[1] "photo"

$updated_at
       $date 
1.357027e+12 

回答1:

try

library(rjson)
path <- "WHERE/YOUR/JSON/IS/SAVED"
c <- file(path, "r")
l <- readLines(c, -1L)
json <- lapply(X=l, fromJSON)


回答2:

Since you want a data.frame, try this:

# three copies of your sample...
line.1<- "{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }" 
line.2<- "{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }" 
line.3<- "{ \"_id\" : \"10151271769737669\", \"comments_count\" : 36, \"created_at\" : { \"$date\" : 1357941938000 }, \"icon\" : \"http://blahblah.gif\", \"likes_count\" : 450, \"link\" : \"http://www.blahblahblah.php\", \"message\" : \"I wish I could figure this out!\", \"page_category\" : \"Computers\", \"page_id\" : \"30968999999\", \"page_name\" : \"NothingButTrouble\", \"type\" : \"photo\", \"updated_at\" : { \"$date\" : 1358210153000 } }" 
x <- paste(line.1, line.2, line.3, sep="\n")
lines <-  readLines(textConnection(x))
library(rjson)

# this is the important bit
df <- data.frame(do.call(rbind,lapply(lines,fromJSON)))
ncol(df)
# [1] 12

# finally, there's some cleaning up to do...
df$created_at
# [[1]]
# [[1]]$`$date`
# [1] 1.357942e+12
# ...
df$created_at <- as.POSIXlt(unname(unlist(df$created_at)/1000),origin="1970-01-01")
df$created_at
# [1] "2013-01-11 17:05:38 EST" "2013-01-11 17:05:38 EST" "2013-01-11 17:05:38 EST"

df$updated_at <- as.POSIXlt(unname(unlist(df$updated_at)/1000),origin="1970-01-01")

Note that this conversion assumes that the dates were stored as milliseconds since the epoch.