I have this vector myvec
. I want to remove everything after second ':' and get the result. How do I remove the string after nth ':'?
myvec<- c("chr2:213403244:213403244:G:T:snp","chr7:55240586:55240586:T:G:snp" ,"chr7:55241607:55241607:C:G:snp")
result
chr2:213403244
chr7:55240586
chr7:55241607
Here are a few alternatives. We delete the kth colon and everything after it. The example in the question would correspond to k = 2. In the examples below we use k = 3.
1) read.table Read the data into a data.frame, pick out the columns desired and paste it back together again:
giving:
2) sprintf/sub Construct the appropriate regular expression (in the case below of k equal to 3 it would be
^((.*?:){2}.*?):.*
) and use it withsub
:giving:
Note 1: For k=1 this can be further simplified to
sub(":.*", "", myvec)
and for k=n-1 it can be further simplified tosub(":[^:]*$", "", myvec)
Note 2: Here is a visualization of the regular regular expression for
k
equal to 3:Debuggex Demo
3) iteratively delete last field We could remove the last field
n-k
times using the last regular expression in Note 1 above like this:If we wanted to set n automatically we could optionally replace the hard coded line setting n above with this:
4) locate position of kth colon Locate the positions of the colons using
gregexpr
and then extract the location of the kth subtracting one from it since we don't want the trailing colon. Usesubstr
to extract that many characters from the respective strings.giving:
Note 3: Suppose there are n fields. The question asked to delete everything after the kth delimiter so the solution should work for k = 1, 2, ..., n-1. It need not work for k = n since there are not n delimiters; however, if instead we define k as the number of fields to return then k = n makes sense and, in fact, (1) and (3) work in that case too. (2) and (4) do not work for this extension but we can easily get them to work by using
paste0(myvec, ":")
as the input instead ofmyvec
.Note 4: We compare performance:
giving:
The solution using sprintf and sub is the fastest although it does use a complex regular expression whereas the others use simpler or no regular expressions and might be preferred on grounds of simplicity.
ADDED Added additional solutions and additional notes.
We can use
sub
. We match one or more characters that are not:
from the start of the string (^([^:]+
) followed by a:
, followed by one more characters not a:
([^:]+
), place it in a capture group i.e. within the parentheses. We replace by the capture group (\\1
) in the replacement.The above works for the example posted. For general cases to remove after the nth delimiter,
Checking with a different 'n'
and repeating the same steps
Or another option would be to split by
:
and thenpaste
the n number of components together.