Not so much 'How do you...?' but more 'How do YOU...?'
If you have a file someone gives you with 200 columns, and you want to reduce it to the few ones you need for analysis, how do you go about it? Does one solution offer benefits over another?
Assuming we have a data frame with columns col1, col2 through col200. If you only wanted 1-100 and then 125-135 and 150-200, you could:
dat$col101 <- NULL
dat$col102 <- NULL # etc
or
dat <- dat[,c("col1","col2",...)]
or
dat <- dat[,c(1:100,125:135,...)] # shortest probably but I don't like this
or
dat <- dat[,!names(dat) %in% c("dat101","dat102",...)]
Anything else I'm missing? I know this is sightly subjective but it's one of those nitty gritty things where you might dive in and start doing it one way and fall into a habit when there are far more efficient ways out there. Much like this question about which.
EDIT:
Or, is there an easy way to create a workable vector of column names? name(dat) doesn't print them with commas in between, which you need in the code examples above, so if you print out the names in that way you have spaces everywhere and have to manually put in commas... Is there a command that will give you "col1","col2","col3",... as your output so you can easily grab what you want?
From http://www.statmethods.net/management/subset.html
Thought it was really clever make a list of "not to include"
To delete single columns, I'll just use
dat$x <- NULL
.To delete multiple columns, but less than about 3-4, I'll use
dat$x <- dat$y <- dat$z <- NULL
.For more than that, I'll use
subset
, with negative names (!):For the kinds of large files I tend to get, I generally wouldn't even do this in R. I would use the
cut
command in Linux to process data before it gets to R. This isn't a critique of R, just a preference for using some very basic Linux tools like grep, tr, cut, sort, uniq, and occasionally sed & awk (or Perl) when there's something to be done about regular expressions.Another reason to use standard GNU commands is that I can pass them back to the source of the data and ask that they prefilter the data so that I don't get extraneous data. Most of my colleagues are competent with Linux, fewer know R.
(Updated) A method that I would like to use before long is to pair
mmap
with a text file and examine the data in situ, rather than read it at all into RAM. I have done this with C, and it can be blisteringly fast.The
select()
function from dplyr is powerful for subsetting columns. See?select_helpers
for a list of approaches.In this case, where you have a common prefix and sequential numbers for column names, you could use
num_range
:More generally you can use the minus sign in
select()
to drop columns, like:Finally, to your question "is there an easy way to create a workable vector of column names?" - yes, if you need to edit a list of names manually, use
dput
to get a comma-separated, quoted list you can easily manipulate:For clarity purposes, I often use the select argument in
subset
. With newer folks, I've learned that keeping the # of commands they need to pick up to a minimum helps adoption. As their skills increase, so too will their coding ability. And subset is one of the first commands I show people when needing to select data within a given criteria.Something like:
I'm sure this will test slower than most other solutions, but I'm rarely at the point where microseconds make a difference.