Dictionary style replace multiple items

2019-01-02 20:40发布

I have a large data.frame of character data that I want to convert based on what is commonly called a dictionary in other languages.

Currently I am going about it like so:

foo <- data.frame(snp1 = c("AA", "AG", "AA", "AA"), snp2 = c("AA", "AT", "AG", "AA"), snp3 = c(NA, "GG", "GG", "GC"), stringsAsFactors=FALSE)
foo <- replace(foo, foo == "AA", "0101")
foo <- replace(foo, foo == "AC", "0102")
foo <- replace(foo, foo == "AG", "0103")

This works fine, but it is obviously not pretty and seems silly to repeat the replace statement each time I want to replace one item in the data.frame.

Is there a better way to do this since I have a dictionary of approximately 25 key/value pairs?

8条回答
不再属于我。
2楼-- · 2019-01-02 21:14

Using dplyr::recode:

library(dplyr)

mutate_all(foo, funs(recode(., "AA" = "0101", "AC" = "0102", "AG" = "0103",
                            .default = NA_character_)))

#   snp1 snp2 snp3
# 1 0101 0101 <NA>
# 2 0103 <NA> <NA>
# 3 0101 0103 <NA>
# 4 0101 0101 <NA>
查看更多
孤独总比滥情好
3楼-- · 2019-01-02 21:16

If you're open to using packages, plyr is a very popular one and has this handy mapvalues() function that will do just what you're looking for:

foo <- mapvalues(foo, from=c("AA", "AC", "AG"), to=c("0101", "0102", "0103"))

Note that it works for data types of all kinds, not just strings.

查看更多
呛了眼睛熬了心
4楼-- · 2019-01-02 21:24
map = setNames(c("0101", "0102", "0103"), c("AA", "AC", "AG"))
foo[] <- map[unlist(foo)]

assuming that map covers all the cases in foo. This would feel less like a 'hack' and be more efficient in both space and time if foo were a matrix (of character()), then

matrix(map[foo], nrow=nrow(foo), dimnames=dimnames(foo))

Both matrix and data frame variants run afoul of R's 2^31-1 limit on vector size when there are millions of SNPs and thousands of samples.

查看更多
低头抚发
5楼-- · 2019-01-02 21:24

Since it's been a few years since the last answer, and a new question came up tonight on this topic and a moderator closed it, I'll add it here. The poster has a large data frame containing 0, 1, and 2, and wants to change them to AA, AB, and BB respectively.

Use plyr:

> df <- data.frame(matrix(sample(c(NA, c("0","1","2")), 100, replace = TRUE), 10))
> df
     X1   X2   X3 X4   X5   X6   X7   X8   X9  X10
1     1    2 <NA>  2    1    2    0    2    0    2
2     0    2    1  1    2    1    1    0    0    1
3     1    0    2  2    1    0 <NA>    0    1 <NA>
4     1    2 <NA>  2    2    2    1    1    0    1
... to 10th row

> df[] <- lapply(df, as.character)

Create a function over the data frame using revalue to replace multiple terms:

> library(plyr)
> apply(df, 2, function(x) {x <- revalue(x, c("0"="AA","1"="AB","2"="BB")); x})
      X1   X2   X3   X4   X5   X6   X7   X8   X9   X10 
 [1,] "AB" "BB" NA   "BB" "AB" "BB" "AA" "BB" "AA" "BB"
 [2,] "AA" "BB" "AB" "AB" "BB" "AB" "AB" "AA" "AA" "AB"
 [3,] "AB" "AA" "BB" "BB" "AB" "AA" NA   "AA" "AB" NA  
 [4,] "AB" "BB" NA   "BB" "BB" "BB" "AB" "AB" "AA" "AB"
... and so on
查看更多
萌妹纸的霸气范
6楼-- · 2019-01-02 21:26

Here's something simple that will do the job:

key <- c('AA','AC','AG')
val <- c('0101','0102','0103')

lapply(1:3,FUN = function(i){foo[foo == key[i]] <<- val[i]})
foo

 snp1 snp2 snp3
1 0101 0101 <NA>
2 0103   AT   GG
3 0101 0103   GG
4 0101 0101   GC

lapply will output a list in this case that we don't actually care about. You could assign the result to something if you like and then just discard it. I'm iterating over the indices here, but you could just as easily place the key/vals in a list themselves and iterate over them directly. Note the use of global assignment with <<-.

I tinkered with a way to do this with mapply but my first attempt didn't work, so I switched. I suspect a solution with mapply is possible, though.

查看更多
像晚风撩人
7楼-- · 2019-01-02 21:32

Used @Ramnath's answer above, but made it read (what to be replaced and what to be replaced with) from a file and use gsub rather than replace.

hrw <- read.csv("hgWords.txt", header=T, stringsAsFactor=FALSE, encoding="UTF-8", sep="\t") 

for (i in nrow(hrw)) 
{
document <- gsub(hrw$from[i], hrw$to[i], document, ignore.case=TRUE)
}

hgword.txt contains the following tab separated

"from"  "to"
"AA"    "0101"
"AC"    "0102"
"AG"    "0103" 
查看更多
登录 后发表回答