I have a data frame that contains a column with multiple values consisting of gene name synonyms separated by semicolons:
score <- c("32.01","19.5","18.0")
symbol <- c("30 kDa adipocyte complemen related protein","AAT1","Cachectin")
synonym <- c("30 kDa adipocyte complemen related protein; 30 kDa adipocyte complement-related protein; ACDC; ACRP30; ADIPOQ; APM-1; APM1; Adipocyte C1Q and collagen domain containing","AAT1; AAT1; ALT-1; ALT1; Alanine aminotransferase; Alanine aminotransferase 1; GPT 1; GPT1; Glutamate pyruvate transaminase; Glutamic--alanine transaminase 1; Glutamic--pyruvic transaminase 1","Cachectin; TNF alpha; TNF-a; TNFA; TNFSF-2; TNFSF2; TNFalpha; Tumor necrosis factor; Tumor necrosis factor ligand superfamily member 2; Tumor necrosis factor precursor; tumor necrosis factor alpha")
df <- data.frame(score, symbol, synonym, stringsAsFactors=FALSE)
This is raw output from data mining. I'm mapping the official gene symbols in the data to Entrez IDs. The symbol column frequently doesn't contain a gene symbol, so I have to extract all the synonyms (typically, there's an official symbol in the list). My goal with wanting to keep track of row numbers is that, once I've mapped all symbols to Entrez IDs, I can identify those rows that didn't map.
I'm currently using strsplit and unlist to parse out the synonyms but I lose track of which row each synonym came from:
tmp <- data.frame(unlist(strsplit(as.character(df$synonym), "; ")))
What I want is something that looks like this:
originalRow <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3)
cbind(tmp, originalRow)
synonym originalRow
1 30 kDa adipocyte complemen related protein 1
2 30 kDa adipocyte complement-related protein 1
3 ACDC 1
4 ACRP30 1
5 ADIPOQ 1
6 APM-1 1
7 APM1 1
8 Adipocyte C1Q and collagen domain containing 1
9 AAT1 2
10 AAT1 2
11 ALT-1 2
12 ALT1 2
13 Alanine aminotransferase 2
14 Alanine aminotransferase 1 2
15 GPT 1 2
16 GPT1 2
17 Glutamate pyruvate transaminase 2
18 Glutamic--alanine transaminase 1 2
19 Glutamic--pyruvic transaminase 1 2
20 Cachectin 3
21 TNF alpha 3
22 TNF-a 3
23 TNFA 3
24 TNFSF-2 3
25 TNFSF2 3
26 TNFalpha 3
27 Tumor necrosis factor 3
28 Tumor necrosis factor ligand superfamily member 2 3
29 Tumor necrosis factor precursor 3
30 tumor necrosis factor alpha 3
Any advice would be appreciated!