How do you remove columns from a data.frame?

2020-01-26 04:34发布

Not so much 'How do you...?' but more 'How do YOU...?'

If you have a file someone gives you with 200 columns, and you want to reduce it to the few ones you need for analysis, how do you go about it? Does one solution offer benefits over another?

Assuming we have a data frame with columns col1, col2 through col200. If you only wanted 1-100 and then 125-135 and 150-200, you could:

dat$col101 <- NULL
dat$col102 <- NULL # etc

or

dat <- dat[,c("col1","col2",...)]

or

dat <- dat[,c(1:100,125:135,...)] # shortest probably but I don't like this

or

dat <- dat[,!names(dat) %in% c("dat101","dat102",...)]

Anything else I'm missing? I know this is sightly subjective but it's one of those nitty gritty things where you might dive in and start doing it one way and fall into a habit when there are far more efficient ways out there. Much like this question about which.

EDIT:

Or, is there an easy way to create a workable vector of column names? name(dat) doesn't print them with commas in between, which you need in the code examples above, so if you print out the names in that way you have spaces everywhere and have to manually put in commas... Is there a command that will give you "col1","col2","col3",... as your output so you can easily grab what you want?

标签: r dataframe
11条回答
放我归山
2楼-- · 2020-01-26 05:03

From http://www.statmethods.net/management/subset.html

# exclude variables v1, v2, v3
myvars <- names(mydata) %in% c("v1", "v2", "v3") 
newdata <- mydata[!myvars]

# exclude 3rd and 5th variable 
newdata <- mydata[c(-3,-5)]

# delete variables v3 and v5
mydata$v3 <- mydata$v5 <- NULL

Thought it was really clever make a list of "not to include"

查看更多
相关推荐>>
3楼-- · 2020-01-26 05:05

To delete single columns, I'll just use dat$x <- NULL.

To delete multiple columns, but less than about 3-4, I'll use dat$x <- dat$y <- dat$z <- NULL.

For more than that, I'll use subset, with negative names (!):

subset(mtcars, , -c(mpg, cyl, disp, hp))
查看更多
女痞
4楼-- · 2020-01-26 05:09

For the kinds of large files I tend to get, I generally wouldn't even do this in R. I would use the cut command in Linux to process data before it gets to R. This isn't a critique of R, just a preference for using some very basic Linux tools like grep, tr, cut, sort, uniq, and occasionally sed & awk (or Perl) when there's something to be done about regular expressions.

Another reason to use standard GNU commands is that I can pass them back to the source of the data and ask that they prefilter the data so that I don't get extraneous data. Most of my colleagues are competent with Linux, fewer know R.

(Updated) A method that I would like to use before long is to pair mmap with a text file and examine the data in situ, rather than read it at all into RAM. I have done this with C, and it can be blisteringly fast.

查看更多
霸刀☆藐视天下
5楼-- · 2020-01-26 05:11

The select() function from dplyr is powerful for subsetting columns. See ?select_helpers for a list of approaches.

In this case, where you have a common prefix and sequential numbers for column names, you could use num_range:

library(dplyr)

df1 <- data.frame(first = 0, col1 = 1, col2 = 2, col3 = 3, col4 = 4)
df1 %>%
  select(num_range("col", c(1, 4)))
#>   col1 col4
#> 1    1    4

More generally you can use the minus sign in select() to drop columns, like:

mtcars %>%
   select(-mpg, -wt)

Finally, to your question "is there an easy way to create a workable vector of column names?" - yes, if you need to edit a list of names manually, use dput to get a comma-separated, quoted list you can easily manipulate:

dput(names(mtcars))
#> c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", 
#> "gear", "carb")
查看更多
Explosion°爆炸
6楼-- · 2020-01-26 05:13

For clarity purposes, I often use the select argument in subset. With newer folks, I've learned that keeping the # of commands they need to pick up to a minimum helps adoption. As their skills increase, so too will their coding ability. And subset is one of the first commands I show people when needing to select data within a given criteria.

Something like:

> subset(mtcars, select = c("mpg", "cyl", "vs", "am"))
                     mpg cyl vs am
Mazda RX4           21.0   6  0  1
Mazda RX4 Wag       21.0   6  0  1
Datsun 710          22.8   4  1  1
....

I'm sure this will test slower than most other solutions, but I'm rarely at the point where microseconds make a difference.

查看更多
登录 后发表回答