Understanding the syntax for Column vs Row indexin

2019-03-03 21:07发布

I'm a bit confused on the filtering scheme on an R data frame.

For example, let's say we have the following data frame titled dframe:

> str(dframe)
'data.frame':   143 obs. of  3 variables:
 $ Year     : int  1999 2005 2007 2008 2009 2010 2005 2006 2007 2008 ...
 $ Name     : Factor w/ 18 levels "AADAM","AADEN",..: 1 1 2 2 2 2 3 3 3 3 ...
 $ Frequency: int  5 6 10 34 38 12 10 6 10 5 ...

Now if I want to filter dframe where the values of Name is of "AADAM", the proper filter is: dframe[dframe$Name=="AADAM",]

The part where I'm confused is why the comma doesn't come first. Why isn't it this: dframe[,dframe$Name=="AARUSH"]

4条回答
Ridiculous、
2楼-- · 2019-03-03 21:31

As others have said, the structure within brackets is row, then column.

One way I think of the syntax of selecting data from a data.frame using:

dframe[dframe$Name=="AADAM",]

is to think of a noun, then a verb where:

dframe[] is the noun. It is the object on which you want to perform an action

and

[dframe$Name=="AADAM",] is the verb. It is the action you want to perform.

I have a silly way of expressing this to myself, but it keeps things straight in my mind:

Hey, you! dframe! I am going to... ...in this case, select all of your rows in which Name is equal to AADAM!

By keeping the column portion of [dframe$Name=="AADAM",] blank you are saying you want to keep all columns.

Sometimes it can be a little difficult to remember that you have to write dframe both inside and outside the brackets.

As for exactly why row comes first and column comes second, I do not know, but row had to be either first or second.

dframe <- read.table(text = '
     Year Name Frequency
       1  ADAM     4
       3  BOB     10
       7  SALLY    5
       2  ADAM    12
       4  JIM      3
      12  ADAM     7
', header = TRUE)

dframe[,dframe$Name=="ADAM"]

# Error in `[.data.frame`(dframe, , dframe$Name == "ADAM") : 
#   undefined columns selected

dframe[dframe$Name=="ADAM",]

#   Year Name Frequency
# 1    1 ADAM         4
# 4    2 ADAM        12
# 6   12 ADAM         7

dframe[,'Name']

# [1] ADAM  BOB   SALLY ADAM  JIM   ADAM 
# Levels: ADAM BOB JIM SALLY


dframe[dframe$Name=="ADAM",'Name']

# [1] ADAM ADAM ADAM
# Levels: ADAM BOB JIM SALLY
查看更多
不美不萌又怎样
3楼-- · 2019-03-03 21:36

As others have indicated, requesting a certain subset of a data frame requires the syntax [rows, columns]. Since dframe[has 143 rows, has 3 columns], any request for some part of dframe should be of the form

dframe[which of the 143 rows do I want?, which of the 3 columns do I want?].

Because dframe$Name is a vector of length 143, the comparison dframe$Name=='AADAM' is a vector of T/F values that also has length 143. So,

dframe[dframe$Name=='AADAM',]

is like saying

dframe[of the 143 rows I want these ones, I want all columns]

whereas

dframe[,dframe$Name=='AADAM']

generates an error because it's like saying

dframe[I want all rows, of the 143 columns I want these ones]

On a side note, you may want to look into the subset() function if you're not already familiar with it. You could get the same result by writing subset(dframe, Name=='AADAM')

查看更多
家丑人穷心不美
4楼-- · 2019-03-03 21:39

UPDATE: You clarified your question is really "Please give examples of what sort of logical expressions are valid for filtering columns?"

I agree with you the syntax appears weird initially, but it has the following logic.

The bottom line is that column-filter expressions are typically less rich and expressive than row-filtering expressions, and in particular you can't chain logical indexing the way you do with rows.

Best way is to think of indexing expressions as the general form:

dframe[<row-index-expression>,<col-index-expression>]

where either index-expression is optional, so you can just do one and we (crucially!) need the comma to disambiguate whether it's row- or column-indexing:

dframe[<row-index-expression>,] # such as dframe[dframe$Name=="ADAM",]

dframe[,<col-index-expression>]

Before we look at examples of col-index-expression and what's valid (and invalid) to include in one, let's review and discuss how R does indexing - I had the same confusion when I started with it.

In this example, you have three columns. You can refer to them by their string names 'Year','Name','Frequency'. You can also refer to them by column indices 1,2,3 where the numbers 1,2,3 correspond to the entries colnames(dframe). R does indexing using the '[' operator, also the '[[' operator. Here are some valid examples of ways to index column-indexing:

dframe[,2]       # column 2 / Name
dframe[,'Name']  # column 2 / Name
dframe[,c('Name','Frequency')]  # string vector - very common
dframe[,c(2,3)]                 # integer vector - also very common
dframe[,c(F,T,T)]               # logical vector - very rarely seen, and a pain in the butt to compute

Now, if you choose to use a logical expression for the column-index, it must be a valid expression without using column names - inside a column it doesn't know their own names. Suppose you wanted to dynamically filter "give me only the factor columns from dframe". Something like:

unlist(apply(dframe[1,1:3], 2, is.factor), use.names=F) # except I can't seem to remove the colnames

For more help and examples on indexing look at the '[' operator help-page: Type ?'['

dframe[,dframe$Name=="ADAM"] is invalid attempt at column-indexing because the columns know nothing about Name=="ADAM"

Addendum: code to generate example dataframe (because you didn't dump us a dput output)

set.seed(123)
N = 10
randomName <- function() { cat(sample(letters, size=runif(1)*6+2, replace=T), sep='') }    
dframe = data.frame(Year=round(runif(N,1980,2014)),
                    Name = as.factor(replicate(N, randomName())),
                    Frequency=round(runif(N, 2,40)))
查看更多
叛逆
5楼-- · 2019-03-03 21:41

You have to remember that when you're sub-setting, the part before the comma is specifying which rows you want, and the part after the comma is specifying which columns you want. ie:

dframe[rowsyouwant, columnsyouwant]

You're filtering based on columns, but you want all of the columns in your result, so the space after the comma is blank. You want some sub-set of rows, so your filtering specification goes before the comma, where the rows you want are specified.

查看更多
登录 后发表回答