When creating functions that use strsplit
, vector inputs do not behave as desired, and sapply
needs to be used. This is due to the list output that strsplit
produces. Is there a way to vectorize the process - that is, the function produces the correct element in the list for each of the elements of the input?
For example, to count the lengths of words in a character vector:
words <- c("a","quick","brown","fox")
> length(strsplit(words,""))
[1] 4 # The number of words (length of the list)
> length(strsplit(words,"")[[1]])
[1] 1 # The length of the first word only
> sapply(words,function (x) length(strsplit(x,"")[[1]]))
a quick brown fox
1 5 5 3
# Success, but potentially very slow
Ideally, something like length(strsplit(words,"")[[.]])
where .
is interpreted as the being the relevant part of the input vector.
In general, you should try to use a vectorized function to begin with. Using
strsplit
will frequently require some kind of iteration afterwards (which will be slower), so try to avoid it if possible. In your example, you should usenchar
instead:More generally, take advantage of the fact that
strsplit
returns a list and uselapply
:Or else use an
l*ply
family function fromplyr
. For instance:Edit:
In honor of Bloomsday, I decided to test the performance of these approaches using Joyce's Ulysses:
Now that I have all the words, we can do our counts:
The vectorized function and
lapply
are considerably faster than the originalsapply
version. All solutions return the same answer (as seen by the summary output).Apparently the latest version of
plyr
is faster (this is using a slightly older version).