I think what I want to do is a fairly common task but I've found no reference on the web. I have text, with punctuation, and I want list of the words.
"Hey, you - what are you doing here!?"
should be
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
But Python's str.split()
only works with one argument... So I have all words with the punctuation after I split with whitespace. Any ideas?
Use replace two times:
results in:
So many answers, yet I can't find any solution that does efficiently what the title of the questions literally asks for (splitting on multiple possible separators—instead, many answers remove anything that is not a word, which is different). So here is an answer to the question in the title, that relies on Python's standard and efficient
re
module:where:
[…]
matches one of the separators listed inside,\-
in the regular expression is here to prevent the special interpretation of-
as a character range indicator (as inA-Z
),+
skips one or more delimiters (it could be omitted thanks to thefilter()
, but this would unnecessarily produce empty strings between matched separators), andfilter(None, …)
removes the empty strings possibly created by leading and trailing separators (since empty strings have a false boolean value).This
re.split()
precisely "splits with multiple separators", as asked for in the question title.This solution also does not suffer from problems with non-ASCII characters in words, as well (see the first comment to ghostdog74's answer).
The
re
module is much more efficient than doing Python loops and tests "by hand".try this:
this will print
['Hey', 'you', 'what', 'are', 'you', 'doing', 'here']
Pro-Tip: Use
string.translate
for the fastest string operations Python has.Some proof...
First, the slow way (sorry pprzemek):
Next, we use
re.findall()
(as given by the suggested answer). MUCH faster:Finally, we use
translate
:Explanation:
string.translate
is implemented in C and unlike many string manipulation functions in Python,string.translate
does not produce a new string. So it's about as fast as you can get for string substitution.It's a bit awkward, though, as it needs a translation table in order to do this magic. You can make a translation table with the
maketrans()
convenience function. The objective here is to translate all unwanted characters to spaces. A one-for-one substitute. Again, no new data is produced. So this is fast!Next, we use good old
split()
.split()
by default will operate on all whitespace characters, grouping them together for the split. The result will be the list of words that you want. And this approach is almost 4x faster thanre.findall()
!First, I want to agree with others that the regex or
str.translate(...)
based solutions are most performant. For my use case the performance of this function wasn't significant, so I wanted to add ideas that I considered with that criteria.My main goal was to generalize ideas from some of the other answers into one solution that could work for strings containing more than just regex words (i.e., blacklisting the explicit subset of punctuation characters vs whitelisting word characters).
Note that, in any approach, one might also consider using
string.punctuation
in place of a manually defined list.Option 1 - re.sub
I was surprised to see no answer so far uses re.sub(...). I find it a simple and natural approach to this problem.
In this solution, I nested the call to
re.sub(...)
insidere.split(...)
— but if performance is critical, compiling the regex outside could be beneficial — for my use case, the difference wasn't significant, so I prefer simplicity and readability.Option 2 - str.replace
This is a few more lines, but it has the benefit of being expandable without having to check whether you need to escape a certain character in regex.
It would have been nice to be able to map the str.replace to the string instead, but I don't think it can be done with immutable strings, and while mapping against a list of characters would work, running every replacement against every character sounds excessive. (Edit: See next option for a functional example.)
Option 3 - functools.reduce
(In Python 2,
reduce
is available in global namespace without importing it from functools.)got same problem as @ooboo and find this topic @ghostdog74 inspired me, maybe someone finds my solution usefull
input something in space place and split using same character if you dont want to split at spaces.