I am a biologist and have less knowledge of programming. I have series of files(fasta format files) for which I need to apply an R package.
each file contents as follows:
FILE_1.FASTA
>>TTBK2_Hsap ,(CK1/TTBK)
MSGGGEQLDILSVGILVKERWKVLRKIGGGGFGEIYDALDMLTRENVALKVESAQQPKQVLKMEVAVLKKLQGKDHVCRFIGCGRNDRFNYVVMQLQGRNLADLRRSQSRGTFT
FILE_2.FASTA
>>TTBK2_Hsap ,(CK1/TTBK)
MSGGGEQLDILSVGILVKERWKVLRKIGGGGFGEIYDALDMLTRENVALKVESAQQPKQVLKMEVAVLKKLQGKDHVCRFIGCGRNDRFNYVVMQLQGRNLADLRRSQSRGTFT
and the package (protr in R) works like this:
x = readFASTA(system.file(’protseq/P00750.fasta’, package = ’protr’))[[1]]
extractAAC(x)
Is there any possibility to set a forloop for the above lines to read multiple files and give the output in one file??
If possible please give me some idea or any example which could help me set a for-loop in R.
It is very possible to do this. A good strategy to use would be to write a function that encapsulates what you want to do with each FASTA file:
# fasta is a string that represents the fasta file to be read.
read_and_extract <- function(fasta){
seq <- readFASTA(fasta)[[1]]
return(extractAAC(seq))
}
This wrapper function will allow you to read the FASTA file and extract the amino acid composition all in one fell swoop. In order to loop over the files, we will need to be in the same directory as your FASTA files.
setwd("path/to/files")
Using the dir
command, you can get all of the names of the files that exist in that directory.
fasta_files <- dir(pattern = "[.]fasta$")
Note that the pattern
argument tells the computer to only read files that end with ".fasta
"
Now we perform the loop using the vapply
function (see note below for details):
aa_comp <- vapply(fasta_files, read_and_extract, rep(pi, 20))
This will produce a matrix with the columns being each fasta file and the rows being each amino acid. Now we can save this as a simple csv file:
write.csv(aa_comp, file = "amino_acid_composition.csv")
Details of vapply
The vapply
function is a fancy (and most times faster) way to do for
loops in R. It looks a bit confusing at first, but it works very well if you know what your output will be. Let's look at the arguments:
> vapply(Argument1, Argument2, Argument3)
- Argument1: The vector to be looped over (
fasta_files
)
- Argument2: The function to apply to each element of the vector (
read_and_extract
)
- Argument3: The expected output (
rep(pi, 20)
)
The last argument is the hardest to grasp initially, but it's a representative vector of our expected output. In this case, the documentation for extractAAC
says that it returns a numeric vector of length 20. The command rep(pi, 20)
is telling R to replicate the number pi
20 times, thus giving a numeric vector of length 20.
There are more generalized versions of vapply
that can return output of any type. See help("vapply")
for details on those.
There's two slightly complicated things here; one is the looping, and the other is writing all the results out to a file.
First off, if all you're trying to do is combine all your fasta
files into one, its going to be much easier from your bash
terminal than in R
:
cat *.fasta > combined.fasta
But to answer your question for R
, your loop could look something like this:
write("", file="combined.fasta") # make sure the file exists before appending
for (fileName in dir(pattern='.fasta')) {
x = readFASTA(system.file(fileName, package = ’protr’))[[1]]
# do stuff to x
write(x, file="combined.fasta", append=TRUE)
}
You can use the straight forward for loop like this:
x <- list() # an empty list
for(f in yourFileList) {
x[[which(yourFileList==f)]] <- readFASTA(system.file(f,package='protr'))[[1]]
}
You'll find more info under ?Control