How do you remove columns from a data.frame?

2020-01-26 04:34发布

Not so much 'How do you...?' but more 'How do YOU...?'

If you have a file someone gives you with 200 columns, and you want to reduce it to the few ones you need for analysis, how do you go about it? Does one solution offer benefits over another?

Assuming we have a data frame with columns col1, col2 through col200. If you only wanted 1-100 and then 125-135 and 150-200, you could:

dat$col101 <- NULL
dat$col102 <- NULL # etc

or

dat <- dat[,c("col1","col2",...)]

or

dat <- dat[,c(1:100,125:135,...)] # shortest probably but I don't like this

or

dat <- dat[,!names(dat) %in% c("dat101","dat102",...)]

Anything else I'm missing? I know this is sightly subjective but it's one of those nitty gritty things where you might dive in and start doing it one way and fall into a habit when there are far more efficient ways out there. Much like this question about which.

EDIT:

Or, is there an easy way to create a workable vector of column names? name(dat) doesn't print them with commas in between, which you need in the code examples above, so if you print out the names in that way you have spaces everywhere and have to manually put in commas... Is there a command that will give you "col1","col2","col3",... as your output so you can easily grab what you want?

标签: r dataframe
11条回答
三岁会撩人
2楼-- · 2020-01-26 04:47

If you have a vector of names already,which there are several ways to create, you can easily use the subset function to keep or drop an object.

dat2 <- subset(dat, select = names(dat) %in% c(KEEP))

In this case KEEP is a vector of column names which is pre-created. For example:

#sample data via Brandon Bertelsen
df <- data.frame(a=rnorm(100),
                 b=rnorm(100),
                 c=rnorm(100),
                 d=rnorm(100),
                 e=rnorm(100),
                 f=rnorm(100),
                 g=rnorm(100))

#creating the initial vector of names
df1 <- as.matrix(as.character(names(df)))

#retaining only the name values you want to keep
KEEP <- as.vector(df1[c(1:3,5,6),])

#subsetting the intial dataset with the object KEEP
df3 <- subset(df, select = names(df) %in% c(KEEP))

Which results in:

> head(df)
            a          b           c          d
1  1.05526388  0.6316023 -0.04230455 -0.1486299
2 -0.52584236  0.5596705  2.26831758  0.3871873
3  1.88565261  0.9727644  0.99708383  1.8495017
4 -0.58942525 -0.3874654  0.48173439  1.4137227
5 -0.03898588 -1.5297600  0.85594964  0.7353428
6  1.58860643 -1.6878690  0.79997390  1.1935813
            e           f           g
1 -1.42751190  0.09842343 -0.01543444
2 -0.62431091 -0.33265572 -0.15539472
3  1.15130591  0.37556903 -1.46640276
4 -1.28886526 -0.50547059 -2.20156926
5 -0.03915009 -1.38281923  0.60811360
6 -1.68024349 -1.18317733  0.42014397

> head(df3)
        a          b           c           e
1  1.05526388  0.6316023 -0.04230455 -1.42751190
2 -0.52584236  0.5596705  2.26831758 -0.62431091
3  1.88565261  0.9727644  0.99708383  1.15130591
4 -0.58942525 -0.3874654  0.48173439 -1.28886526
5 -0.03898588 -1.5297600  0.85594964 -0.03915009
6  1.58860643 -1.6878690  0.79997390 -1.68024349
            f
1  0.09842343
2 -0.33265572
3  0.37556903
4 -0.50547059
5 -1.38281923
6 -1.18317733
查看更多
beautiful°
3楼-- · 2020-01-26 04:49

Use read.table with colClasses instances of "NULL" to avoid creating them in the first place:

## example data and temp file
x <- data.frame(x = 1:10, y = rnorm(10), z = runif(10), a = letters[1:10], stringsAsFactors = FALSE)
tmp <- tempfile()
write.table(x, tmp, row.names = FALSE)


(y <- read.table(tmp, colClasses = c("numeric", rep("NULL", 2), "character"), header = TRUE))

x a
1   1 a
2   2 b
3   3 c
4   4 d
5   5 e
6   6 f
7   7 g
8   8 h
9   9 i
10 10 j

unlink(tmp)
查看更多
Juvenile、少年°
4楼-- · 2020-01-26 04:50

Can use setdiff function:

If there are more columns to keep than to delete: Suppose you want to delete 2 columns say col1, col2 from a data.frame DT; you can do the following:

DT<-DT[,setdiff(names(DT),c("col1","col2"))]

If there are more columns to delete than to keep: Suppose you want to keep only col1 and col2:

DT<-DT[,c("col1","col2")]
查看更多
虎瘦雄心在
5楼-- · 2020-01-26 04:51

Sometimes I like to do this using column ids instead.

df <- data.frame(a=rnorm(100),
b=rnorm(100),
c=rnorm(100),
d=rnorm(100),
e=rnorm(100),
f=rnorm(100),
g=rnorm(100)) 

as.data.frame(names(df))

  names(df)
1         a
2         b
3         c
4         d
5         e
6         f
7         g 

Removing columns "c" and "g"

df[,-c(3,7)]

This is especially useful if you have data.frames that are large or have long column names that you don't want to type. Or column names that follow a pattern, because then you can use seq() to remove.

RE: Your edit

You don't necessarily have to put "" around a string, nor "," to create a character vector. I find this little trick handy:

x <- unlist(strsplit(
'A
B
C
D
E',"\n"))
查看更多
▲ chillily
6楼-- · 2020-01-26 04:53

Just addressing the edit.

@nzcoops, you do not need the column names in a comma delimited character vector. You are thinking about this the wrong way round. When you do

vec <- c("col1", "col2", "col3")

you are creating a character vector. The , just separates arguments taken by the c() function when you define that vector. names() and similar functions return a character vector of names.

> dat <- data.frame(col1 = 1:3, col2 = 1:3, col3 = 1:3)
> dat
  col1 col2 col3
1    1    1    1
2    2    2    2
3    3    3    3
> names(dat)
[1] "col1" "col2" "col3"

It is far easier and less error prone to select from the elements of names(dat) than to process its output to a comma separated string you can cut and paste from.

Say we want columns col1 and col2, subset names(dat), retaining only the ones we want:

> names(dat)[c(1,3)]
[1] "col1" "col3"
> dat[, names(dat)[c(1,3)]]
  col1 col3
1    1    1
2    2    2
3    3    3

You can kind of do what you want, but R will always print the vector the screen in quotes ":

> paste('"', names(dat), '"', sep = "", collapse = ", ")
[1] "\"col1\", \"col2\", \"col3\""
> paste("'", names(dat), "'", sep = "", collapse = ", ")
[1] "'col1', 'col2', 'col3'"

so the latter may be more useful. However, now you have to cut and past from that string. Far better to work with objects that return what you want and use standard subsetting routines to keep what you need.

查看更多
一夜七次
7楼-- · 2020-01-26 05:02

I use data.table's := operator to delete columns instantly regardless of the size of the table.

DT[, coltodelete := NULL]

or

DT[, c("col1","col20") := NULL]

or

DT[, (125:135) := NULL]

or

DT[, (variableHoldingNamesOrNumbers) := NULL]

Any solution using <- or subset will copy the whole table. data.table's := operator merely modifies the internal vector of pointers to the columns, in place. That operation is therefore (almost) instant.

查看更多
登录 后发表回答