I am trying to load a json file into a data.frame in r. I have had some luck with the fromJSON function in the jsonlite package - But am getting nested lists and am not sure how to flatten the input into a two dimensional data.frame. Jsonlite reads the file in as a data.frame, but leaves nested lists in some of the variables.
Does Anyone have any tips in loading a JSON file to a data.frame when it reads in with nested lists.
#*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*# HERE IS MY EXAMPLE #*#*#*#*#*#*#*#*#*##*#*#*#*#*#*#*#*#*#
# loads the packages
library("httr")
library( "jsonlite")
# downloads an example file
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE )
# the flatten function breaks the name variable into three vars ( first name, middle name, last name)
providers <- flatten( providers )
# but many of the columns are still lists:
sapply( providers , class)
# Some of these lists have a single level
head( providers$facility_type )
# Some have lot more than two - for example nine
providers[ , 6][[1]]
I want one row per npi, and than seperate columns for each of the slices of the individual lists - so that the data frame has cols for "plan_id_type","plan_id","network_tier" nine times, maybe colnames, from 0 to 8. I have been able to use this site: http://www.convertcsv.com/json-to-csv.htm to get this file in two dimensions, but since I am doing hundreds of these I would love to be able to do it dynamically. This is the file: http://s000.tinyupload.com/download.php?file_id=10808537503095762868&t=1080853750309576286812811 - I would like to get a file with this structure load as a data.frame using the the fromJson function
HERE are a few of the things I have tried; So I have thought of two approaches; First: use a different function to read in the Json file, I have looked at
rjson but that reads in a list
library( rjson )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
class( providers )
and I have tried RJSONIO - I tried this Getting imported json data into a data frame in R
json-data-into-a-data-frame-in-r
library( RJSONIO )
providers <- fromJSON( getURL( "https://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json") )
json_file <- lapply(providers, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
# but When converting the lists to a data.frame I get an error
a <- do.call("rbind", json_file)
So, the second approach I have tried is to convert all the lists into variables in my data.frame
detach("package:RJSONIO", unload = TRUE )
detach("package:rjson", unload = TRUE )
library( "jsonlite")
providers <- fromJSON( "http://fm.formularynavigator.com/jsonFiles/publish/11/47/providers.json" , simplifyDataFrame=TRUE )
providers <- flatten( providers )
I am able to pull one of the lists - but because of missings I can't merge back on to my dataframe
a <- data.frame(Reduce(rbind, providers$facility_type))
length( a ) == nrow( providers )
I also tried these suggestions: Converting nested list to dataframe. A well as some other stuff but haven't had any luck
a <- sapply( providers$facility_type, unlist )
as.data.frame(t(sapply( providers$providers, unlist )) )
Any help much appreciated
Update: 21 February 2016
col_fixer
updated to include avec2col
argument that lets you flatten a list column into either a single string or a set of columns.In the
data.frame
you've downloaded, I see several different column types. There are normal columns comprising vectors of the same type. There are list columns where the items may beNULL
or may themselves be a flat vector. There are list columns where there aredata.frame
s as the list elements. There are list columns that contain adata.frame
of the same number of rows as the maindata.frame
.Here's a sample dataset that recreates those conditions:
The
str
of this sampledata.frame
looks like:One way you can "flatten" this is to "fix" the list columns. There are three fixes.
flatten
(from "jsonlite") will take care of columns like the "person" column.toString
, which would convert each element to a comma separated item or which can be converted into multiple columns.data.frame
s, some with multiple rows, first need to be flattened into a single row (by transforming to a "wide" format) and then need to be bound together as a singledata.table
. (I'm using "data.table" for reshaping and for binding the rows together).We can take care of the second and third points with a function like the following:
We'll integrate that and the
flatten
function in another function that would do most of the processing.Running the function gives us:
Or, with the vectors going into separate columns:
Here's the
str
:On your "providers" object, this runs very quickly and consistently:
So this isn't really eligible as a solution since it doesn't directly answer the question, but here is how I would analyze this data.
First, I had to understand your data set. It appears to be information about health providers.
FACILITY
entries have the "ID" fieldsfacility_name
andfacility_type
.INDIVIDUAL
entries have the "ID" fieldsname
,speciality
,accepting
,languages
, andgender
.npi
andlast_updated_on
.addresses
andplans
. For exampleaddresses
is alist
that contains city, state, etc.Since there are multiple addresses for each
npi
, I'd prefer to convert them to a data frame with columns for the city, state, etc. I'll also make a similar data frame for theplans
. Then I'll join theaddresses
andplans
into a single data frame. Hence, if there are 4 addresses and 8 plans, there will be 4*8=32 rows in the joined data frame. Finally, I'll tac on a similarly denormalized data frame with "ID" information using another merge.Then do some cleanup.
And now you can ask some interesting questions. For example, how many addresses does each health care provider have?
At addresses with more than five people, what is the percent of male healthcare providers?
And on and on...
My first step was to load the data via
RCurl::getURL()
andrjson::fromJSON()
, as per your second code sample:Next, to get a deep understanding of the structure and cleanness of the data, I wrote a set of helper functions:
The key to understanding the above is the
keyList
parameter. Let's say you have a list like this:That would select all city strings underneath the second and third address elements underneath the addresses list underneath all elements of the main list.
There are no built-in apply functions in R that can operate on such "parallel" node selections (
rapply()
is close, but no cigar), which is why I wrote my own.levelApply()
finds each of the matching nodes and runs the givenfunc()
on it (defaultidentity()
, thus returning the node itself), returning the results to the caller, either joined as perjoinFunc()
, or in the same recursive list structure in which those nodes existed in the input list. Quick demo:Here are the remaining helper functions I wrote in the process of working on this problem:
I've tried to capture the sequence of commands I ran against the data as I first examined it. Below are the results, showing the commands I ran, the command output, and leading comments describing what my intention was, and my conclusion from the output:
Here's my summary of the data:
addresses
is a list of variable length,plans
is a list always of length 9, andname
is a hash.addresses
list element is a hash with 5 or 6 keys to scalar strings,address_2
being the inconsistent one.plans
list element is a hash with 3 keys to scalar strings, no inconsistencies.name
hash hasfirst
andlast
but not alwaysmiddle
scalar strings.The most important observation here is that there are no type-inconsistencies between parallel nodes (aside from omissions and length differences). That means we can combine all parallel nodes into vectors with no considerations of type coercion. We can flatten all the data into a two-dimensional structure provided we associate columns with deep-enough nodes, such that all columns correspond to a single scalar string node in the input list.
Below is my solution. Note that it depends on the helper functions
tl()
,keyListToStr()
, andmkcsv()
I defined earlier.The
extractLevelColumns()
function traverses the input list and extracts all node values at each leaf node position, combining them into a vector with NA where the value was missing, and then transforming to a one-column data.frame. The column name is set immediately, leveraging a parameterizedmkname()
function to define the stringification of thekeyList
to the string column name. Multiple columns are returned as a list of data.frames from each recursive call and likewise from the top-level call.It also validates that there are no type-inconsistencies between parallel nodes. Although I manually verified the consistency of the data earlier, I tried to write as generic and reusable a solution as possible, because it's always a good idea to do so, so this validation step is appropriate.
flattenList()
is the primary interface function; it simply callsextractLevelColumns()
and thendo.call(cbind,...)
to combine the columns into a single data.frame.An advantage of this solution is that it's entirely generic; it can handle an unlimited number of depth levels, by virtue of being fully recursive. Additionally, it has no package dependencies, parameterizes the column name building logic, and forwards variadic arguments to
data.frame()
, so for example you can passstringsAsFactors=F
to inhibit the automatic factorization of character columns normally done bydata.frame()
, and/orrow.names={namevector}
to set the row names of the resulting data.frame, orrow.names=NULL
to prevent the use of the top-level list component names as row names, if such existed in the input list.I've also added a
sep
parameter which defaults toNULL
. IfNULL
, multi-element leaf nodes will be separated into multiple columns, one per element, with an index suffix on the column name for differentiation. Otherwise, it's taken as a string separator on which to join all elements to a single string, and only one column is generated for the node.In terms of performance, it's very fast. Here's a demo:
Result:
The resulting data.frame is quite wide, but we can use
rowToFrame()
andnpiToFrame()
to get a good vertical layout of one row at a time. For example, here's the first row:I've tested the result pretty thoroughly by doing many spot-checks on individual records, and it all looks correct. Let me know if you have any questions.
This answer is rather a data organization suggestion (and is much shorter than the bounty-attracting answers around;)
If you want to keep the semantics of the fields, like keep all
plan_id
s in a single column, you can normalize your data design a bit, and do joins afterwards, if you need the information together:Then you can first filter on the data and join in other information afterwards: