I need a fast and concise way to split string literals in a data framte into a set of columns. Let's say I have this data frame
data <- data.frame(id=c(1,2,3), tok1=c("a, b, c", "a, a, d", "b, d, e"), tok2=c("alpha|bravo", "alpha|charlie", "tango|tango|delta") )
(pls note the different delimiters among columns)
The number of string columns is usually not known in advance (altough I can try to discover the whole set of cases if I've no alternatives)
I need two data frames like those:
tok1.occurrences:
+----+---+---+---+---+---+
| id | a | b | c | d | e |
+----+---+---+---+---+---+
| 1 | 1 | 1 | 1 | 0 | 0 |
| 2 | 2 | 0 | 0 | 1 | 0 |
| 3 | 0 | 1 | 0 | 1 | 1 |
+----+---+---+---+---+---+
tok2.occurrences:
+----+-------+-------+---------+-------+-------+
| id | alpha | bravo | charlie | delta | tango |
+----+-------+-------+---------+-------+-------+
| 1 | 1 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 1 | 0 | 0 |
| 3 | 0 | 0 | 0 | 1 | 2 |
+----+-------+-------+---------+-------+-------+
I tried using this syntax:
tok1.f = factor(data$tok1)
dummies <- model.matrix(~tok1.f)
this ended up in a incomplete solution. It creates my dummy vars correctly, but not (obviously) splitting against the delimiter.
I know i can use the 'tm' package to find a document-term matrix, but it's seems way too much for such simple tokenization. Is there a more straight way?
If you don't mind using
data.table
(temporarily), this might work for you:You end up with a list of data frames that you can then process as you see fit.
The easiest thing that I can think of is to use my
cSplit
function in conjunction withdcast.data.table
, like this:Edit: Updated with
library(splitstackshape)
sincecSplit
is now part of that package.You could use
stringr
package as follows:Column one:
Column two: