Remove all text before colon

2019-01-08 09:54发布


I have a file containing a certain number of lines. Each line looks like this:


I would like to remove all before ":" character in order to retain only PKMYT1 that is a gene name. Since I'm not an expert in regex scripting can anyone help me to do this using Unix (sed or awk) or in R?


Here are two ways of doing it in R:

foo <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"

# Remove all before and up to ":":

# Extract everything behind ":":


A simple regular expression used with gsub():

x <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"
gsub(".*:", "", x)

See ?regex or ?gsub for more help.


There are certainly more than 2 ways in R. Here's another.

unlist(lapply(strsplit(foo, ':', fixed = TRUE), '[', 2))

If the string has a constant length I imagine substr would be faster than this or regex methods.


Using sed:

sed 's/.*://' < your_input_file > output_file

This will replace anything followed by a colon with nothing, so it'll remove everything up to and including the last colon on each line (because * is greedy by default).

As per Josh O'Brien's comment, if you wanted to only replace up to and including the first colon, do this:

sed "s/[^:]*://"

That will match anything that isn't a colon, followed by one colon, and replace with nothing.

Note that for both of these patterns they'll stop on the first match on each line. If you want to make a replace happen for every match on a line, add the 'g' (global) option to the end of the command.

Also note that on linux (but not on OSX) you can edit a file in-place with -i eg:

sed -i 's/.*://' your_file


You can use awk like this:

awk -F: '{print $2}' /your/file


If you have GNU coreutils available use cut:

cut -d: -f2 infile


I was working on a similar issue. John's and Josh O'Brien's advice did the trick. I started with this tibble:

my_tibble <- tibble(Col1=c("ABC:Content","BCDE:MoreContent","FG:Conent:with:colons"))

It looks like:

  | Col1 
1 | ABC:Content 
2 | BCDE:MoreContent 
3 | FG:Content:with:colons

I needed to create this tibble:

  | Col1                  | Col2 | Col3 
1 | ABC:Content           | ABC  | Content 
2 | BCDE:MoreContent      | BCDE | MoreContent 
3 | FG:Content:with:colons| FG   | Content:with:colons

And did so with this code (R version 3.4.2).

my_tibble2 <- mutate(my_tibble
        ,Col2 = unlist(lapply(strsplit(Col1, ':',fixed = TRUE), '[', 1))
        ,Col3 = gsub("^[^:]*:", "", Col1))


Below are 2 equivalent solutions:

The first uses perl's -a autosplit feature to split each line into fields using :, populate the F fields array, and print the 2nd field $F[1] (counted starting from field 0)

perl -F: -lane 'print $F[1]' file

The second uses a regular expression to substitute s/// from ^ the beginning of the line, .*: any characters ending with a colon, with nothing

perl -pe 's/^.*://' file


Some very simple move that I missed from the best response @Sacha Epskamp was to use the sub function, in this case to take everything before the ":"(instead of removing it), so it was very simple:

foo <- "TF_list_to_test10004/Nus_k0.345_t0.1_e0.1.adj:PKMYT1"

# 1st, as she did to remove all before and up to ":":

# 2nd, to keep everything before and up to ":": 

Basically, the same thing, just change the ":" position inside the sub argument. Hope it will help.