extract part of a file name in R

2019-03-01 19:05发布

问题:

I'm trying to write some code to open all the data files in a folder, apply a function (or set of functions) to extract my data of interest. So far, so good. The problem is that I'd like to re-name one of the columns I'm extracting from each file using one element of the file name, and I'm having a hard time figuring out how to extract it.

I have a bunch of files named "YYYY-MM-DD geneName data copy.txt" and would like to extract the "geneName" part of the file name. (For example, I have "2012-05-31 PMA1 data copy.txt".)

The date format is always the same (YYYY-MM-DD), and all the file names end in "data copy.txt".

Additionally, some of the file names have an additional experiment annotation (either "E(number)" or "Expt(number)") in the file name between the date and geneName (for example, "2012-05-21 E7 PMA1 data copy.txt"); others have "SDM" between the geneName and "data copy.txt".

Here's a list of some file names and my desired output:

  • 2012-05-31 CTN1 data copy.txt (I want "CTN1)
  • 2012-05-21 E7 PMA1 data copy.txt (want "PMA1")
  • 2011-11-29 TDH3 SDM data copy.txt (want "TDH3")
  • 2012-01-04 POX1 data copy.txt (want "POX1")

Any thoughts about how I can do that without having to remove the experiment number or "SDM" from some of the files by hand?

Thanks!

回答1:

The pattern here is a date, an optional E\digit or Expt\digit that you don't want, a word that you do want, then an optional SDM that you don't want followed by 'data copy.txt'...

Here's my test data:

> names
[1] "2012-05-31 CTN1 data copy.txt"          
[2] "2012-05-21 E7 PMA1 data copy.txt"       
[3] "2011-11-29 TDH3 SDM data copy.txt"      
[4] "2012-01-04 POX1 data copy.txt"          
[5] "2011-11-29 ECHO data copy.txt"          
[6] "2011-11-29 E8 ECHO data copy.txt"       
[7] "2011-11-29 ECHO SDM data copy.txt"      
[8] "2011-11-29 Expt2 ECHO SDM data copy.txt"

and here's my sub:

> sub(pattern="^....-..-.. (E\\d+ |Expt\\d+ )*(\\w+) (SDM )*data copy.txt","\\2",names)
[1] "CTN1" "PMA1" "TDH3" "POX1" "ECHO" "ECHO" "ECHO" "ECHO"

If your E-prefixes have more than one digit this will also work. I've tried to add some things to my test set starting with E to make sure they get treated properly, as well as the case of an E-prefix and an SDM.