On reading this question, I thought the following problem would be simple using StringSplit
Given the following string, I want to 'cut' it to the left of every "D" such that:
I get a List of fragments (with sequence unchanged)
StringJoin
@fragments gives back the original string (but is does not matter if I have to reorder the fragments to obtain this). That is, sequence within each fragment is important, and I do not want to lose any characters.
(The example I am interested in is a protein sequence (string) where each character represents an amino acid in one-letter code. I want to obtain the theoretical list of ALL fragments obtained by treating with an enzyme known to split before "D")
str = "MTPDKPSQYDKIEAELQDICNDVLELLDSKGDYFRYLSEVASGDN"
The best I can come up with is to insert a space before each "D" using StringReplace
and then use StringSplit
. This seems quite awkward, to say the least.
frags1 = StringSplit@StringReplace[str, "D" -> " D"]
giving as output:
{"MTP", "DKPSQY", "DKIEAELQ", "DICN", "DVLELL", "DSKG", "DYFRYLSEVASG", "DN"}
or, alternatively, using StringReplacePart
:
frags1alt =
StringSplit@StringReplacePart[str, " D", StringPosition[str, "D"]]
Finally (and more realistically), if I want to split before "D" provided that the residue immediately preceding it is not "P" [ie P-D,(Pro-Asp) bonds are not cleaved], I do it as follows:
StringSplit@StringReplace[str, (x_ /; x != "P") ~~ "D" -> x ~~ " D"]
Is there a more elegant way?
Speed is not necessarily an issue. I am unlikely to be dealing with strings of greater than, say, 500 characters. I am using Mma 7.
Update
I have added the bioinformatics tag, and I thought it might be of interest to add an example from that field.
The following imports a protein sequence (Bovine serum albumin, accession number 3336842) from the NCBI database using eutils and then generates a (theoretical) trypsin digest. I have assumed that the enzyme tripsin cleaves between residues A1-A2 when A1 is either "R" or "K", provided that A2 is not "R", "K" or "P". If anyone has any suggestions for improvements, please feel free to suggest modifications.
Using a modification of sakra's method ( a carriage return after '?db=' possibly needs to be removed):
StringJoin /@
Split[Characters[#],
And @@ Function[x, #1 != x] /@ {"R", "K"} ||
Or @@ Function[xx, #2 == xx] /@ {"R", "K", "P"} &] & @
StringJoin@
Rest@Import[
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=\
protein&id=3336842&rettype=fasta&retmode=text", "Data"]
My possibly ham-fisted attempt at using the regex method (Sasha/WReach) to do the same thing:
StringSplit[#, RegularExpression["(?![PKR])(?<=[KR])"]] &@
StringJoin@Rest@Import[...]
Output
{MK,WVTFISLLLLFSSAYSR,GVFRR,<<69>>,CCAADDK,EACFAVEGPK,LVVSTQTALA}