I have the following data structure in R:

df <- structure(
  list(
    ID = c(1L, 2L, 3L, 4L, 5L),
    var1 = c('a', 'b', 'c', 'd', 'e'),
    var2 = structure(
      list(
        var2a = c('v', 'w', 'x', 'y', 'z'),
        var2b = c('vv', 'ww', 'xx', 'yy', 'zz')),
      .Names = c('var2a', 'var2b'),
      row.names = c(NA, 5L),
      class = 'data.frame'),
    var3 = c('aa', 'bb', 'cc', 'dd', 'ee')),
  .Names = c('ID', 'var1', 'var2', 'var3'),
  row.names = c(NA, 5L),
  class = 'data.frame')

# Looks like this:
#   ID var1 var2.var2a var2.var2b var3
# 1  1    a          v         vv   aa
# 2  2    b          w         ww   bb
# 3  3    c          x         xx   cc
# 4  4    d          y         yy   dd
# 5  5    e          z         zz   ee

This looks like a normal data frame, and it behaves like that for the most part; but see length and class properties of the columns below:

class(df)
# [1] "data.frame"

df[1,]
# ID var1 var2.var2a var2.var2b var3
# 1     a          v         vv   aa

dim(df)
# [1] 5 4
# One less than expected due to embedded data frame

lapply(df, class)
# $ID
# [1] "integer"
# 
# $var1
# [1] "character"
# 
# $var2
# [1] "data.frame"
# 
# $var3
# [1] "character"

lapply(df, length)
# $ID
# [1] 5
#
# $var1
# [1] 5
#
# $var2
# [1] 2
#
# $var3
# [1] 5
# str(df)

# 'data.frame': 5 obs. of  4 variables:
#   $ ID  : int  1 2 3 4 5
# $ var1: chr  "a" "b" "c" "d" ...
# $ var2:'data.frame':  5 obs. of  2 variables:
#   ..$ var2a: chr  "v" "w" "x" "y" ...
# ..$ var2b: chr  "vv" "ww" "xx" "yy" ...
# $ var3: chr  "aa" "bb" "cc" "dd" ...

My questions:

1) What is this?

I've never come across this before. Is it a common format for some of you out there? What are potential use cases?

2) What is this called?

I called this "embedded" for lack of a better word. Somebody suggested "nested", but I don't think that's right, see separate section with tidyverse tibbles below.

3) Why is it allowed?

I would have expected the structure command above to fail, because I though that data.frames are essentially lists, where each element (column) has the same number of elements (rows). This rule seems violated in this example, as var2 has length = 2 (number of columns!). Yet, subsetting df surprisingly succeeds in the usual way:

df[3,]
#   ID var1 var2.var2a var2.var2b var3
# 3  3    c          x         xx   cc

What's going on?

I don't think I could call it a "nested" structure, that terminology is used for nested data.frames which would look and behave like this:

library(tidyverse)
df <- data_frame(
  x = c(1L, 2L, 3L),
  nested = list(data_frame(x = c('a', 'b', 'c')), 
                data_frame(x = c('a', 'b', 'c')), 
                data_frame(x = c('d', 'e', 'f'))))
unnest(df)
# # A tibble: 9 × 2
#       x     x
#   <int> <chr>
# 1     1     a
# 2     1     b
# 3     1     c
# 4     2     a
# 5     2     b
# 6     2     c
# 7     3     d
# 8     3     e
# 9     3     f

标签： r dataframe nested

1条回答

劳资没心，怎么记你

2楼-- · 2019-03-25 01:46

I think the strucutre makes it pretty clear

str(df)
# 'data.frame':   5 obs. of  4 variables:
#  $ ID  : int  1 2 3 4 5
#  $ var1: chr  "a" "b" "c" "d" ...
#  $ var2:'data.frame':   5 obs. of  2 variables:
#   ..$ var2a: chr  "v" "w" "x" "y" ...
#   ..$ var2b: chr  "vv" "ww" "xx" "yy" ...
#  $ var3: chr  "aa" "bb" "cc" "dd" ...

It's a data.frame with a column (var2) that contains a data.frame. This isn't super easy to create so i'm not quite sure how you did it but it isn't technically "illegal" in R.

data.frames can contain matrices and other data.frames. So it doesn't just look at the length() of the elements, it looks at the dim() of the elements to see if it has the right number of "rows".

I often "fix" or expand these data.frames using

fixed <- do.call("data.frame", df)

0人赞添加讨论(0) 举报

“Embedded” data.frame in R. What is it, what is it

1) What is this?

2) What is this called?

3) Why is it allowed?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间