How to merge two data frame based on partial strin

2019-02-21 05:56发布

I have two data frames:

the first one contains a huge number of proteins for which I have made several calculations. here an example:

Accession Description # Peptides A2 # PSM A2 # Peptides B2 # PSM B2 # Peptides C2 # PSM C2 # Peptides D2 # PSM D2 # Peptides E2 # PSM E2 # AAs MW [kDa] calc. pI P01837 Ig kappa chain C region OS=Mus musculus PE=1 SV=1 - [IGKC_MOUSE] 10 319 8 128 8 116 7 114 106 11,8 5,41 P01868 Ig gamma-1 chain C region secreted form OS=Mus musculus GN=Ighg1 PE=1 SV=1 - [IGHG1_MOUSE] 13 251 15 122 16 116 16 108 324 35,7 7,40 P60710 Actin, cytoplasmic 1 OS=Mus musculus GN=Actb PE=1 SV=1 - [ACTB_MOUSE] 15 215 10 37 11 30 11 31 16 154 375 41,7 5,48

the second contains the proteins of interest. here an example:

complex Description Accession protein TFIID [TAF1_MOUSE] Q80UV9-3 Isoform 3 of Transcription initiation factor TFIID subunit 1 OS=Mus musculus GN=Taf1 - [TAF1_MOUSE] TFIID [TAF2_MOUSE] Q8C176 Transcription initiation factor TFIID subunit 2 OS=Mus musculus GN=Taf2 PE=2 SV=2 - [TAF2_MOUSE] TFIID [TAF3_MOUSE] Q5HZG4 Transcription initiation factor TFIID subunit 3 OS=Mus musculus GN=Taf3 PE=1 SV=2 - [TAF3_MOUSE]

What I want to do: get one data frame containing the values from my calculations for the proteins of interest only. In a first attempt I used:

fusion <- merge.data.frame(x=tableaucleanIPTAFXwoNA, y=sublist, by.x="Description", by.y="protein", all =FALSE)

However, the nomenclature of the protein names are different between the two data frames and using the merge function this does not work.

So, how could I perform a partial match for "TAF10" when it is part of "Transcription initiation factor TFIID subunit 10 OS=Mus musculus GN=Taf10 PE=1 SV=1 - [TAF10_MOUSE]" string text ? In other words I want R recognizes only a piece o f the whole string.

I tried to use grep function:

idx2 <- sapply("tableaucleanIPTAFX$Description", grep, "sublist$Description")  

However, I got that:

as.data.frame(idx2)
[1] tableaucleanIPTAFX.Description
<0 rows> (or 0-length row.names)

I guess that, the pattern is not correctly recognized... Then I visited the RegExr website to write a regular expression so that my id names can be recognized. I found that this works to recognize [TRRAP_MOUSE] into

Transformation/transcription domain-associated protein OS=Mus musculus GN=Trrap PE=1 SV=2 - [TRRAP_MOUSE] :

with

 /(TRRAP_[MOUSE])\w+/g

I wonder how I can implement it to my id list (the "Description" column in my example) ?

2条回答
相关推荐>>
2楼-- · 2019-02-21 06:35

(Your question is a bit vague - it would be better with some sample/foobar data - so this answer unfortunately is too)

Try this:

?grep                                       # Pattern Matching and Replacement
X <- data.frame(a = letters[1:10])
grep(pattern = "c", x = X$a)                # returns position of "c": 3
grepl(pattern = "c", x = X$a)               # returns a vector of bools: [ F F T F F ... ]
X[grepl(pattern = "c", x = X$a),"a") <- "C" # replaces "c" with "C"

PS:

  • depending on how big / dirty your element names lists are, I've often found it useful to (i) create a clean (short and unambiguous) dictionary of names, (ii) add a new column with this new name to each original list and (iii) perform the merge with these columns;
  • aside from base::merge, I like to use dplyr's join functions (mostly because I like their cheat sheet);
查看更多
够拽才男人
3楼-- · 2019-02-21 06:47

This might work for you and it handles duplicates:

First some dummy data:

df1 <- data.frame(name=c("George", "Abraham", "Barack"), stringsAsFactors = F)
df2 <- data.frame(president=c("Thanks, Obama (Barack)","Lincoln, Abraham, George""George Washington"), stringsAsFactors = F)

Find the code in the full description using grep:

idx2 <- sapply(df1$name, grep, df2$president)

This can result in multiple matches if multiple descriptions match the code so here I duplicate the original indices so the results align:

idx1 <- sapply(seq_along(idx2), function(i) rep(i, length(idx2[[i]])))

"merge" the datasets with cbind aligned on the new indices:

> cbind(df1[unlist(idx1),,drop=F], df2[unlist(idx2),,drop=F])
       name                president
1    George Lincoln, Abraham, George
1.1  George        George Washington
2   Abraham Lincoln, Abraham, George
3    Barack   Thanks, Obama (Barack)
查看更多
登录 后发表回答