-->

How to set a for -loop in R

2020-08-01 05:07发布

问题:

I am a biologist and have less knowledge of programming. I have series of files(fasta format files) for which I need to apply an R package.

each file contents as follows:

FILE_1.FASTA

>>TTBK2_Hsap ,(CK1/TTBK)
MSGGGEQLDILSVGILVKERWKVLRKIGGGGFGEIYDALDMLTRENVALKVESAQQPKQVLKMEVAVLKKLQGKDHVCRFIGCGRNDRFNYVVMQLQGRNLADLRRSQSRGTFT

FILE_2.FASTA

>>TTBK2_Hsap ,(CK1/TTBK)
MSGGGEQLDILSVGILVKERWKVLRKIGGGGFGEIYDALDMLTRENVALKVESAQQPKQVLKMEVAVLKKLQGKDHVCRFIGCGRNDRFNYVVMQLQGRNLADLRRSQSRGTFT

and the package (protr in R) works like this:

x = readFASTA(system.file(’protseq/P00750.fasta’, package = ’protr’))[[1]]

extractAAC(x)

Is there any possibility to set a forloop for the above lines to read multiple files and give the output in one file??

If possible please give me some idea or any example which could help me set a for-loop in R.

回答1:

It is very possible to do this. A good strategy to use would be to write a function that encapsulates what you want to do with each FASTA file:

# fasta is a string that represents the fasta file to be read.
read_and_extract <- function(fasta){
    seq <- readFASTA(fasta)[[1]]
    return(extractAAC(seq))
}

This wrapper function will allow you to read the FASTA file and extract the amino acid composition all in one fell swoop. In order to loop over the files, we will need to be in the same directory as your FASTA files.

setwd("path/to/files")

Using the dir command, you can get all of the names of the files that exist in that directory.

fasta_files <- dir(pattern = "[.]fasta$")

Note that the pattern argument tells the computer to only read files that end with ".fasta"

Now we perform the loop using the vapply function (see note below for details):

aa_comp <- vapply(fasta_files, read_and_extract, rep(pi, 20))

This will produce a matrix with the columns being each fasta file and the rows being each amino acid. Now we can save this as a simple csv file:

write.csv(aa_comp, file = "amino_acid_composition.csv")

Details of vapply

The vapply function is a fancy (and most times faster) way to do for loops in R. It looks a bit confusing at first, but it works very well if you know what your output will be. Let's look at the arguments:

> vapply(Argument1, Argument2, Argument3)

  • Argument1: The vector to be looped over (fasta_files)
  • Argument2: The function to apply to each element of the vector (read_and_extract)
  • Argument3: The expected output (rep(pi, 20))

The last argument is the hardest to grasp initially, but it's a representative vector of our expected output. In this case, the documentation for extractAAC says that it returns a numeric vector of length 20. The command rep(pi, 20) is telling R to replicate the number pi 20 times, thus giving a numeric vector of length 20.

There are more generalized versions of vapply that can return output of any type. See help("vapply") for details on those.



回答2:

There's two slightly complicated things here; one is the looping, and the other is writing all the results out to a file.

First off, if all you're trying to do is combine all your fasta files into one, its going to be much easier from your bash terminal than in R:

cat *.fasta > combined.fasta

But to answer your question for R, your loop could look something like this:

write("", file="combined.fasta")  # make sure the file exists before appending
for (fileName in dir(pattern='.fasta')) {
    x = readFASTA(system.file(fileName, package = ’protr’))[[1]]
    # do stuff to x
    write(x, file="combined.fasta", append=TRUE)
}


回答3:

You can use the straight forward for loop like this:

x <- list() # an empty list

for(f in yourFileList) {
  x[[which(yourFileList==f)]] <- readFASTA(system.file(f,package='protr'))[[1]]
}

You'll find more info under ?Control