Creating dummy variables in R based on multiple ch

I'm trying to create multiple dummy variables, based on one column called 'Tags' within my df (14 rows, 2 columns, Score and Tags. My problem is that in each cell there can be any number of chr values (up to about 30 values).

When I ask for:

 str(df$Tags)

R returns:

chr [1:14] "\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"lactose intolera"| __truncated__ ...

And when I ask for:

df$Tags[1]

R returns:

[1] "\"biologische gerechten\", \"certificaat van uitmuntendheid tripadvisor 2016\", \"gebruik streekproducten\", \"lactose intolerantie\", \"met familie\", \"met vrienden\", \"noten allergie\", \"pinda allergie\", \"vegetarische gerechten\", chinees, gastronomisch, glutenvrij, kindvriendelijk, romantisch, traditioneel, trendy, verjaardag, zakelijk"

It seems that the values within the first cell are not formatted the same (the values between comma's)

So what I wish for, is to create a dummy variable for each possible value that occurs within each cells. So the first new dummy should be called "biologische gerechten" (or any alike) and should show for each case whether the corresponding value is present (1) in the column 'Tags' or not (0).

i tried several things with 'dplyr' like:

df = mutate(df, biologisch = ifelse(Tags == "biologische gerechten", 1, 0))

R does create a new column 'biologisch', but it only contains zero's. Is there another way to separate all values and then create dummy variables for all possible values? Hope someone can help me, thank you!

标签： r string character dplyr dummy-variable

1条回答

Fickle 薄情

2楼-- · 2019-09-08 12:23

Here's one solution:

# make some toy data to test
set.seed(1)
df <- data.frame(Score = rnorm(10),
                 Tags = replicate(10, paste(sample(LETTERS, 5), collapse = ", ")),
                 stringsAsFactors = FALSE)

# load stringr, which we'll use to trim whitespace from the split-up tags
library(stringr)

# use strsplit to break your jumbles of tags into separate elements, with a 
# list for each position in the original vector. i've split on commas here,
# but you'll probably want to split on slashes or slashes and quotation marks.
t <- strsplit(df$Tags, split = ",")

# get a vector of the unique elements of those lists. you may need to use str_trim
# or something like it to cut leading and trailing whitespace. you might also
# need to use stringr's `str_subset` and a regular expression to cut the result
# down to, say, only alphanumeric strings. without a reproducible example, though,
# i can't do that for your specific case here.
tags <- unique(str_trim(unlist(t)))

# now, use `sapply` and `grepl` to look for each element of `tags` in each list;
# use `any` to summarize the results; 
# use `+` to convert those summaries to binary;
# use `lapply` to iterate that process over all elements of `tags`;
# use `Reduce(cbind, ...)` to collapse the results into a table; and
# use `as.data.frame` to turn that table into a df.
df2 <- as.data.frame(Reduce(cbind, lapply(tags, function(i) sapply(t, function(j) +(any(grepl(i, j), na.rm = TRUE))))))

# assign the tags as column names
names(df2) <- tags

Voila:

> df2
   Y F P C Z K A J U H M O L E S R T Q V B I X G
1  1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2  0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3  0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0
4  0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
5  0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0
6  0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0
7  0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0
8  0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
9  1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0
10 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 1

0人赞添加讨论(0) 举报

Creating dummy variables in R based on multiple ch

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间