I have the following data structure in R:
df <- structure(
list(
ID = c(1L, 2L, 3L, 4L, 5L),
var1 = c('a', 'b', 'c', 'd', 'e'),
var2 = structure(
list(
var2a = c('v', 'w', 'x', 'y', 'z'),
var2b = c('vv', 'ww', 'xx', 'yy', 'zz')),
.Names = c('var2a', 'var2b'),
row.names = c(NA, 5L),
class = 'data.frame'),
var3 = c('aa', 'bb', 'cc', 'dd', 'ee')),
.Names = c('ID', 'var1', 'var2', 'var3'),
row.names = c(NA, 5L),
class = 'data.frame')
# Looks like this:
# ID var1 var2.var2a var2.var2b var3
# 1 1 a v vv aa
# 2 2 b w ww bb
# 3 3 c x xx cc
# 4 4 d y yy dd
# 5 5 e z zz ee
This looks like a normal data frame, and it behaves like that for the most part; but see length
and class
properties of the columns below:
class(df)
# [1] "data.frame"
df[1,]
# ID var1 var2.var2a var2.var2b var3
# 1 a v vv aa
dim(df)
# [1] 5 4
# One less than expected due to embedded data frame
lapply(df, class)
# $ID
# [1] "integer"
#
# $var1
# [1] "character"
#
# $var2
# [1] "data.frame"
#
# $var3
# [1] "character"
lapply(df, length)
# $ID
# [1] 5
#
# $var1
# [1] 5
#
# $var2
# [1] 2
#
# $var3
# [1] 5
# str(df)
# 'data.frame': 5 obs. of 4 variables:
# $ ID : int 1 2 3 4 5
# $ var1: chr "a" "b" "c" "d" ...
# $ var2:'data.frame': 5 obs. of 2 variables:
# ..$ var2a: chr "v" "w" "x" "y" ...
# ..$ var2b: chr "vv" "ww" "xx" "yy" ...
# $ var3: chr "aa" "bb" "cc" "dd" ...
My questions:
1) What is this?
I've never come across this before. Is it a common format for some of you out there? What are potential use cases?
2) What is this called?
I called this "embedded" for lack of a better word. Somebody suggested "nested", but I don't think that's right, see separate section with tidyverse
tibble
s below.
3) Why is it allowed?
I would have expected the structure
command above to fail, because I though that data.frames are essentially lists, where each element (column) has the same number of elements (rows). This rule seems violated in this example, as var2
has length = 2
(number of columns!). Yet, subsetting df
surprisingly succeeds in the usual way:
df[3,]
# ID var1 var2.var2a var2.var2b var3
# 3 3 c x xx cc
What's going on?
I don't think I could call it a "nested" structure, that terminology is used for nested data.frames
which would look and behave like this:
library(tidyverse)
df <- data_frame(
x = c(1L, 2L, 3L),
nested = list(data_frame(x = c('a', 'b', 'c')),
data_frame(x = c('a', 'b', 'c')),
data_frame(x = c('d', 'e', 'f'))))
unnest(df)
# # A tibble: 9 × 2
# x x
# <int> <chr>
# 1 1 a
# 2 1 b
# 3 1 c
# 4 2 a
# 5 2 b
# 6 2 c
# 7 3 d
# 8 3 e
# 9 3 f
I think the strucutre makes it pretty clear
It's a data.frame with a column (
var2
) that contains a data.frame. This isn't super easy to create so i'm not quite sure how you did it but it isn't technically "illegal" in R.data.frames can contain matrices and other data.frames. So it doesn't just look at the
length()
of the elements, it looks at thedim()
of the elements to see if it has the right number of "rows".I often "fix" or expand these data.frames using