Unlist multiple values in dataframe column but kee

2019-09-15 00:59发布

问题:

I have a data frame that contains a column with multiple values consisting of gene name synonyms separated by semicolons:

score <- c("32.01","19.5","18.0")
symbol <- c("30 kDa adipocyte complemen related protein","AAT1","Cachectin")
synonym <- c("30 kDa adipocyte complemen related protein; 30 kDa adipocyte complement-related protein; ACDC; ACRP30; ADIPOQ; APM-1; APM1; Adipocyte C1Q and collagen domain containing","AAT1; AAT1; ALT-1; ALT1; Alanine aminotransferase; Alanine aminotransferase 1; GPT 1; GPT1; Glutamate pyruvate transaminase; Glutamic--alanine transaminase 1; Glutamic--pyruvic transaminase 1","Cachectin; TNF alpha; TNF-a; TNFA; TNFSF-2; TNFSF2; TNFalpha; Tumor necrosis factor; Tumor necrosis factor ligand superfamily member 2; Tumor necrosis factor precursor; tumor necrosis factor alpha")
df <- data.frame(score, symbol, synonym, stringsAsFactors=FALSE)

This is raw output from data mining. I'm mapping the official gene symbols in the data to Entrez IDs. The symbol column frequently doesn't contain a gene symbol, so I have to extract all the synonyms (typically, there's an official symbol in the list). My goal with wanting to keep track of row numbers is that, once I've mapped all symbols to Entrez IDs, I can identify those rows that didn't map.

I'm currently using strsplit and unlist to parse out the synonyms but I lose track of which row each synonym came from:

tmp <- data.frame(unlist(strsplit(as.character(df$synonym), "; ")))

What I want is something that looks like this:

originalRow <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3)
cbind(tmp, originalRow)

   synonym                                           originalRow 
1   30 kDa adipocyte complemen related protein           1
2   30 kDa adipocyte complement-related protein          1
3   ACDC                                                 1
4   ACRP30                                               1
5   ADIPOQ                                               1
6   APM-1                                                1
7   APM1                                                 1
8   Adipocyte C1Q and collagen domain containing         1
9   AAT1                                                 2
10   AAT1                                                2
11   ALT-1                                               2
12   ALT1                                                2
13   Alanine aminotransferase                            2
14   Alanine aminotransferase 1                          2
15   GPT 1                                               2
16   GPT1                                                2
17   Glutamate pyruvate transaminase                     2
18   Glutamic--alanine transaminase 1                    2
19   Glutamic--pyruvic transaminase 1                    2
20   Cachectin                                           3
21   TNF alpha                                           3
22   TNF-a                                               3
23   TNFA                                                3
24   TNFSF-2                                             3
25   TNFSF2                                              3
26   TNFalpha                                            3
27   Tumor necrosis factor                               3
28   Tumor necrosis factor ligand superfamily member 2   3
29   Tumor necrosis factor precursor                     3
30   tumor necrosis factor alpha                         3

Any advice would be appreciated!

回答1:

Here you can split each row, combine the values for the row with it's row number into a data.frame, then bind all those data.frames together. You can do that with

do.call("rbind", Map(data.frame, 
    synonym=strsplit(as.character(df$synonym), ";"), 
    origRow=seq_along(df$synonym))
)


回答2:

Another approach would be to store the synonyms in a list which then can be iterated through to compare to the symbol array. Working within a single list element removes the need to track the original row number. This will also trim whitespace for comparison.

lst <- lapply(synonym, function(x) trimws(unlist(strsplit(x, ";"))))
lapply(lst, setdiff, symbol)  # return values not in symbol array

[[1]]
[1] "30 kDa adipocyte complement-related protein"  "ACDC"                                        
[3] "ACRP30"                                       "ADIPOQ"                                      
[5] "APM-1"                                        "APM1"                                        
[7] "Adipocyte C1Q and collagen domain containing"

[[2]]
[1] "ALT-1"                            "ALT1"                             "Alanine aminotransferase"        
[4] "Alanine aminotransferase 1"       "GPT 1"                            "GPT1"                            
[7] "Glutamate pyruvate transaminase"  "Glutamic--alanine transaminase 1" "Glutamic--pyruvic transaminase 1"

[[3]]
[1] "TNF alpha"                                         "TNF-a"                                            
[3] "TNFA"                                              "TNFSF-2"                                          
[5] "TNFSF2"                                            "TNFalpha"                                         
[7] "Tumor necrosis factor"                             "Tumor necrosis factor ligand superfamily member 2"
[9] "Tumor necrosis factor precursor"                   "tumor necrosis factor alpha"    


标签: r strsplit