可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Here is small example:
X1 <- c("AC", "AC", "AC", "CA", "TA", "AT", "CC", "CC")
X2 <- c("AC", "AC", "AC", "CA", "AT", "CA", "AC", "TC")
X3 <- c("AC", "AC", "AC", "AC", "AA", "AT", "CC", "CA")
mydf1 <- data.frame(X1, X2, X3)
Input data frame
X1 X2 X3
1 AC AC AC
2 AC AC AC
3 AC AC AC
4 CA CA AC
5 TA AT AA
6 AT CA AT
7 CC AC CC
8 CC TC CA
The function
# Function
atgc <- function(x) {
xlate <- c( "AA" = 11, "AC" = 12, "AG" = 13, "AT" = 14,
"CA"= 12, "CC" = 22, "CG"= 23,"CT"= 24,
"GA" = 13, "GC" = 23, "GG"= 33,"GT"= 34,
"TA"= 14, "TC" = 24, "TG"= 34,"TT"=44,
"ID"= 56, "DI"= 56, "DD"= 55, "II"= 66
)
x = xlate[x]
}
outdataframe <- sapply (mydf1, atgc)
outdataframe
X1 X2 X3
AA 11 11 12
AA 11 11 12
AA 11 11 12
AG 13 13 12
CA 12 12 11
AC 12 13 13
AT 14 11 12
AT 14 14 14
Problem, AC is not eaqual to 12 in output rather 11, similarly for others. Just mess !
( Exta: Also I do not know how to get rid of the rownames.)
回答1:
Just use apply
and transpose:
t(apply (mydf1, 1, atgc))
To use sapply
, then either use:
stringsAsFactors=FALSE
when creating your data frame, i.e.
mydf1 <- data.frame(X1, X2, X3, stringsAsFactors=FALSE)
(thanks @joran) or
Change the last line of your function to: x = xlate[as.vector(x)]
回答2:
The `match function can use factor arguments with a target matching vector that is "character" class:
atgc <- function(fac){ c(11, 12, 13, 14,
12, 22, 23, 24,
13, 23, 33, 34,
14, 24, 34,44,
56, 56, 55, 66 )[
match(fac,
c("AA", "AC", "AG", "AT",
"CA", "CC", "CG","CT",
"GA", "GC", "GG","GT" ,
"TA", "TC", "TG","TT",
"ID", "DI", "DD", "II") )
]}
#The match function returns an index that is designed to pull from a vector.
sapply(mydf1, atgc)
X1 X2 X3
[1,] 12 12 12
[2,] 12 12 12
[3,] 12 12 12
[4,] 12 12 12
[5,] 14 14 11
[6,] 14 12 14
[7,] 22 12 22
[8,] 22 24 12
回答3:
This way, you only have to supply replacement values for each individual letter in the matrix, without having to double-check to make sure you considered all combinations and matched them correctly, although with your example the combinations are limited.
Define list with values and their substitute:
trans <- list(c("A","1"),c("C","2"),c("G","3"),c("T","4"),
c("I","6"),c("D","5"))
Define replacement function using gsub()
atgc2 <- function(myData, x) gsub(x[1], x[2], myData)
Create a matrix with replaced values (in this case, converting mydf1
to a matrix returned character values as desired for gsub()
, but you would want to check whether this works with any other data before proceeding)
mymat <- Reduce(atgc2, trans, init = as.matrix(mydf1))
The values in mymat
are still in the order in which they originally appeared, so "AC" = "12"
and "CA" = "21"
, so reorder them (and convert them to numeric values)
ansVec <- sapply( strsplit( mymat, split = ""),
function(x) as.numeric( paste0( sort( as.numeric(x) ), collapse = "")))
The object ansVec
is a vector, so convert it back into a data.frame
( mydf2 <- data.frame( matrix( ansVec, nrow = nrow(mydf1) ) ) )
# X1 X2 X3
# 1 12 12 12
# 2 12 12 12
# 3 12 12 12
# 4 12 12 12
# 5 14 14 11
# 6 14 12 14
# 7 22 12 22
# 8 22 24 12
For this situation, the other answers are definitely faster. However, as the replacement operations get more complex, I think this solution might offer some benefits. One of the aspects this method wouldn't address, however, would be checking the string "ATTGCG"
for both "ATT"
and "TTG"
.
回答4:
Actually, I think you want to represent your original vectors as factors, because they represent a finite set of levels (DNA dinucleotides) rather than arbitrary character values.
lvls = c("AA", "AC", "AG", "AT", "CA", "CC", "CG", "CT", "GA", "GC",
"GG", "GT", "TA", "TC", "TG", "TT", "ID", "DI", "DD", "II")
X1 <- factor(c("AC", "AC", "AC", "CA", "TA", "AT", "CC", "CC"), levels=lvls)
X2 <- factor(c("AC", "AC", "AC", "CA", "AT", "CA", "AC", "TC"), levels=lvls)
X3 <- factor(c("AC", "AC", "AC", "AC", "AA", "AT", "CC", "CA"), levels=lvls)
mydf1 <- data.frame(X1, X2, X3)
Likewise, "11" is a level of a factor, and not the number eleven. So a mapping between levels is
xlate <- c("AA" = "11", "AC" = "12", "AG" = "13", "AT" = "14",
"CA"= "12", "CC" = "22", "CG"= "23","CT"= "24",
"GA" = "13", "GC" = "23", "GG"= "33","GT"= "34",
"TA"= "14", "TC" = "24", "TG"= "34","TT"="44",
"ID"= "56", "DI"= "56", "DD"= "55", "II"= "66")
and to 're-level' a single variable
levels(X1) <- xlate
To re-level all columns of the data frame,
as.data.frame(lapply(mydf1, `levels<-`, xlate))
Using sapply
isn't appropriate, because that creates a matrix (of character), even though you've named it outdataframe
. The distinction might actually be important for the SNP data that this might represent, since millions of SNPs across 1000's of samples as a matrix would be implemented a single vector of length longer than the longest vector R can store (modulo large vector support being introduced in R-devel), whereas the data frame would be a list of vectors of only millions of elements each.