Replace single instances of a character that is so

2019-06-14 22:17发布

问题:

I have a string with each character being separated by a pipe character (including the "|"s themselves), for example:

"f|u|n|n|y||b|o|y||a||c|a|t"

I would like to replace all "|"s which are not next to another "|" with nothing, to get the result:

"funny|boy|a|cat"

I tried using mytext.replace("|", ""), but that removes everything and makes one long word.

回答1:

This can be achieved with a relatively simple regex without having to chain str.replace:

>>> import re
>>> s = "f|u|n|n|y||b|o|y||a||c|a|t"
>>> re.sub('\|(?!\|)' , '', s)
'funny|boy|a|cat'

Explanation: \|(?!\|) will look for a | character which is not followed by another | character. (?!foo) means negative lookahead, ensuring that whatever you are matching is not followed by foo.



回答2:

Use sentinel values

Replace the || by ~. This will remember the ||. Then remove the |s. Finally re-replace them with |.

>>> s = "f|u|n|n|y||b|o|y||a||c|a|t"
>>> s.replace('||','~').replace('|','').replace('~','|')
'funny|boy|a|cat'

Another better way is to use the fact that they are almost alternate text. The solution is to make them completely alternate...

s.replace('||','|||')[::2] 


回答3:

You could replace the double pipe by something else first to make sure that you can still recognize them after removing the single pipes. And then you replace those back to a pipe:

>>> t = "f|u|n|n|y||b|o|y||a||c|a|t"
>>> t.replace('||', '|-|').replace('|', '').replace('-', '|')
'funny|boy|a|cat'

You should try to choose a replacement value that is a safe temporary value and does not naturally appear in your text. Otherwise you will run into conflicts where that character is replace even though it wasn’t a double pipe originally. So don’t use a dash as above if your text may contain a dash. You can also use multiple characters at once, for example: '<THIS IS A TEMPORARY PIPE>'.

If you want to avoid this conflict completely, you could also solve this entirely different. For example, you could split the string by the double pipes first and perform a replacement on each substring, ultimately joining them back together:

>>> '|'.join([s.replace('|', '') for s in t.split('||')])
'funny|boy|a|cat'

And of course, you could also use regular expressions to replace those pipes that are not followed by another pipe:

>>> import re
>>> re.sub('\|(?!\|)', '', t)
'funny|boy|a|cat'


回答4:

You can use a positive look ahead regex to replace the pips that are followed with an alphabetical character:

>>> import re
>>> st = "f|u|n|n|y||b|o|y||a||c|a|t" 
>>> re.sub(r'\|(?=[a-z]|$)',r'',st)
'funny|boy|a|cat'


回答5:

Use regular expressions.

import re

line = "f|u|n|n|y||b|o|y||a||c|a|t" 
line = re.sub("(?!\|\|)(\|)", "", line)

print(line)

Output :

funny|boy|a|cat


回答6:

An another regex option with capturing group.

>>> import re
>>> re.sub(r'\|(\|?)', r'\1', "f|u|n|n|y||b|o|y||a||c|a|t")
'funny|boy|a|cat'

Explanation:

\| - Matches all the pipe characters. (\|?) - Captures the following pipe character if present. Then replacing the match with \1 will bring you the content of first capturing group. So in the place of single pip, it would give an empty string and in ||, it would bring the second pipe character.

Another trick through word and non-word boundaries...

>>> re.sub(r'\b\|\b|\b\|\B', '', "f|u|n|n|y||b|o|y||a||c|a|t|")
'funny|boy|a|cat'

Yet another one using negative lookbehind..

>>> re.sub(r'(?<!\|)\|', '', "f|u|n|n|y||b|o|y||a||c|a|t|")
'funny|boy|a|cat'

Bonus...

>>> re.sub(r'\|(\|)|\|', lambda m: m.group(1) if m.group(1) else '', "f|u|n|n|y||b|o|y||a||c|a|t")
'funny|boy|a|cat'


回答7:

If you are going to use a regex, the fastest method which is to split and join:

In [18]: r = re.compile("\|(?!\|)")

In [19]: timeit "".join(r.split(s))
100000 loops, best of 3: 2.65 µs per loop
In [20]:  "".join(r.split(s))
Out[20]: 'funny|boy|a|cat'
In [30]: r1 = re.compile('\|(?!\|)')

In [31]: timeit r1.sub("", s)
100000 loops, best of 3: 3.20 µs per loop

In [33]: r2 = re.compile("(?!\|\|)(\|)")
In [34]: timeit r2.sub("",s)
100000 loops, best of 3: 3.96 µs per loop

The str.split and str.replace methods are still faster:

In [38]: timeit '|'.join([ch.replace('|', '') for ch in s.split('||')])
The slowest run took 11.18 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 1.71 µs per loop

In [39]: timeit s.replace('||','|||')[::2]
1000000 loops, best of 3: 536 ns per loop

In [40]: timeit s.replace('||','~').replace('|','').replace('~','|')
1000000 loops, best of 3: 881 ns per loop

Depending on what can be in the string will determine the str.replaceapproach but the str.split method will work no matter what characters are in the string.