I would like to split a string, with multiple delimiters, but keep the delimiters in the resulting list. I think this is a useful thing to do an an initial step of parsing any kind of formula, and I suspect there is a nice Python solution.
Someone asked a similar question in Java here.
For example, a typical split looks like this:
>>> s='(twoplusthree)plusfour'
>>> s.split(f, 'plus')
['(two', 'three)', 'four']
But I'm looking for a nice way to add the plus back in (or retain it):
['(two', 'plus', 'three)', 'plus', 'four']
Ultimately I'd like to do this for each operator and bracket, so if there's a way to get
['(', 'two', 'plus', 'three', ')', 'plus', 'four']
all in one go, then all the better.
You can do that with Python's re
module.
import re
s='(twoplusthree)plusfour'
list(filter(None, re.split(r"(plus|[()])", s)))
You can leave out the list if you only need an iterator.
import re
s = '(twoplusthree)plusfour'
l = re.split(r"(plus|\(|\))", s)
a = [x for x in l if x != '']
print a
output:
['(', 'two', 'plus', 'three', ')', 'plus', 'four']
Here is an easy way using re.split
:
import re
s = '(twoplusthree)plusfour'
re.split('(plus)', s)
Output:
['(two', 'plus', 'three)', 'plus', 'four']
re.split
is very similar to string.split
except that instead of a literal delimiter you pass a regex pattern. The trick here is to put () around the pattern so it gets extracted as a group.
Bear in mind that you'll have empty strings if there are two consecutive occurrencies of the delimiter pattern
this thread is old, but since its top google result i thought of adding this:
if you dont want to use regex there is a simpler way to do it. basically just call split, but put back the separator except on the last token
def split_keep_deli(string_to_split, deli):
result_list = []
tokens = string_to_split.split(deli)
for i in xrange(len(tokens) - 1):
result_list.append(tokens[i] + deli)
result_list.append(tokens[len(tokens)-1])
return result_list
Here i'm spliting a string on first occurance of alpha characters:
def split_on_first_alpha(i):
#i="3.5 This is one of the way"
split_1=re.split(r'[a-z]',i,maxsplit=1, flags=re.IGNORECASE)
find_starting=re.findall(r'[a-z]',i,flags=re.IGNORECASE)
split_1[1]=find_starting[0]+split_1[1]
return split_1