'StringCut' to the left or right of a defi

2019-04-21 20:48发布

问题:

On reading this question, I thought the following problem would be simple using StringSplit

Given the following string, I want to 'cut' it to the left of every "D" such that:

  1. I get a List of fragments (with sequence unchanged)

  2. StringJoin@fragments gives back the original string (but is does not matter if I have to reorder the fragments to obtain this). That is, sequence within each fragment is important, and I do not want to lose any characters.

(The example I am interested in is a protein sequence (string) where each character represents an amino acid in one-letter code. I want to obtain the theoretical list of ALL fragments obtained by treating with an enzyme known to split before "D")

str = "MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"

The best I can come up with is to insert a space before each "D" using StringReplace and then use StringSplit. This seems quite awkward, to say the least.

frags1 = StringSplit@StringReplace[str, "D" -> " D"]

giving as output:

{"MTP", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}

or, alternatively, using StringReplacePart:

frags1alt = 
 StringSplit@StringReplacePart[str, " D", StringPosition[str, "D"]]

Finally (and more realistically), if I want to split before "D" provided that the residue immediately preceding it is not "P" [ie P-D,(Pro-Asp) bonds are not cleaved], I do it as follows:

StringSplit@StringReplace[str, (x_ /; x != "P") ~~ "D" -> x ~~ " D"]

Is there a more elegant way?

Speed is not necessarily an issue. I am unlikely to be dealing with strings of greater than, say, 500 characters. I am using Mma 7.

Update

I have added the bioinformatics tag, and I thought it might be of interest to add an example from that field.

The following imports a protein sequence (Bovine serum albumin, accession number 3336842) from the NCBI database using eutils and then generates a (theoretical) trypsin digest. I have assumed that the enzyme tripsin cleaves between residues A1-A2 when A1 is either "R" or "K", provided that A2 is not "R", "K" or "P". If anyone has any suggestions for improvements, please feel free to suggest modifications.

Using a modification of sakra's method ( a carriage return after '?db=' possibly needs to be removed):

StringJoin /@ 
   Split[Characters[#], 
    And @@ Function[x, #1 != x] /@ {"R", "K"} || 
      Or @@ Function[xx, #2 == xx] /@ {"R", "K", "P"} &] & @
 StringJoin@
  Rest@Import[
    "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=\
protein&id=3336842&rettype=fasta&retmode=text", "Data"]

My possibly ham-fisted attempt at using the regex method (Sasha/WReach) to do the same thing:

StringSplit[#, RegularExpression["(?![PKR])(?<=[KR])"]] &@
 StringJoin@Rest@Import[...]

Output

{MK,WVTFISLLLLFSSAYSR,GVFRR,<<69>>,CCAADDK,EACFAVEGPK,LVVSTQTALA}

回答1:

Here are some alternate solutions:

Splitting by any occurrence of "D":

In[18]:= StringJoin /@ Split[Characters["MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"], #2!="D" &]
Out[18]:= {"MTP", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}

Splitting by any occurrence of "D" provided it is not preceded by "P":

In[19]:= StringJoin /@ Split[Characters["MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"], #2!="D" || #1=="P" &]
Out[19]:= {"MTPDKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}


回答2:

I can not build anything much simpler that your code. Here is a regex code, which you might happen to like:

In[281]:= StringSplit@
 StringReplace[str, RegularExpression["(?<!P)D"] -> " D"]

Out[281]= {"MTPDKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", \
"DYFRYLSEVASG", "DN"}

It uses negative lookbehind pattern, borrowed from this site.


EDIT Adding WReach's cool solution:

In[2]:= StringSplit[str, RegularExpression["(?<!P)(?=D)"]]

Out[2]= {"MTPDKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", \
"DYFRYLSEVASG", "DN"}


回答3:

Your first solution isn't that bad, is it? Everything that I can think of is longer or uglier than that. Is the problem there might be spaces in the original string?

StringCases[str, "D" | StartOfString ~~ Longest[Except["D"] ..]]

or

Prepend["D" <> # & /@ Rest[StringSplit[str, "D"]], First[StringSplit[str, "D"]]]