I think what I want to do is a fairly common task but I've found no reference on the web. I have text, with punctuation, and I want list of the words.
"Hey, you - what are you doing here!?"
should be
['hey', 'you', 'what', 'are', 'you', 'doing', 'here']
But Python's str.split()
only works with one argument... So I have all words with the punctuation after I split with whitespace. Any ideas?
A case where regular expressions are justified:
I think the following is the best answer to suite your needs :
\W+
maybe suitable for this case, but may not be suitable for other cases.Here is the answer with some explanation.
or in one line, we can do like this:
updated answer
Here is the usage:
I like re, but here is my solution without it:
sep.__contains__ is a method used by 'in' operator. Basically it is the same as
but is more convenient here.
groupby gets our string and function. It splits string in groups using that function: whenever a value of function changes - a new group is generated. So, sep.__contains__ is exactly what we need.
groupby returns a sequence of pairs, where pair[0] is a result of our function and pair[1] is a group. Using 'if not k' we filter out groups with separators (because a result of sep.__contains__ is True on separators). Well, that's all - now we have a sequence of groups where each one is a word (group is actually an iterable so we use join to convert it to string).
This solution is quite general, because it uses a function to separate string (you can split by any condition you need). Also, it doesn't create intermediate strings/lists (you can remove join and the expression will become lazy, since each group is an iterator)
Then this becomes a three-liner:
Explanation
This is what in Haskell is known as the List monad. The idea behind the monad is that once "in the monad" you "stay in the monad" until something takes you out. For example in Haskell, say you map the python
range(n) -> [1,2,...,n]
function over a List. If the result is a List, it will be append to the List in-place, so you'd get something likemap(range, [3,4,1]) -> [0,1,2,0,1,2,3,0]
. This is known as map-append (or mappend, or maybe something like that). The idea here is that you've got this operation you're applying (splitting on a token), and whenever you do that, you join the result into the list.You can abstract this into a function and have
tokens=string.punctuation
by default.Advantages of this approach: