I am attempting to create new variables using a function and lapply
rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here.
Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign
), I have left the naming process until after the lapply
. I am then using a for
loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some sort of reshaping?
#Reproducible data
data <- data.frame("custID" = c(1:10, 1:20),
"v1" = rep(c("A", "B"), c(10,20)),
"v2" = c(30:21, 20:19, 1:3, 20:6), stringsAsFactors = TRUE)
#Function to analyze customer distribution for each category (v1)
pf <- function(cat, df) {
df <- df[df$v1 == cat,]
df <- df[order(-df$v2),]
#Divide the customers into top percents
nr <- nrow(df)
p10 <- round(nr * .10, 0)
cat("Number of people in the Top 10% :", p10, "\n")
p20 <- round(nr * .20, 0)
p11_20 <- p20-p10
cat("Number of people in the 11-20% :", p11_20, "\n")
#Keep only those customers in the top groups
df <- df[1:p20,]
#Create a variable to identify the percent group the customer is in
top_pct <- integer(length = p10 + p11_20)
#Identify those in each group
top_pct[1:p10] <- 10
top_pct[(p10+1):p20] <- 20
#Add this variable to the data frame
df$top_pct <- top_pct
#Keep only custID and the new variable
df <- subset(df, select = c(custID, top_pct))
return(df)
}
##Run the customer distribution function
v1Levels <- levels(data$v1)
res <- lapply(v1Levels, pf, df = data)
#Explore the results
summary(res)
# Length Class Mode
# [1,] 2 data.frame list
# [2,] 2 data.frame list
print(res)
# [[1]]
# custID top_pct
# 1 1 10
# 2 2 20
#
# [[2]]
# custID top_pct
# 11 1 10
# 16 6 10
# 12 2 20
# 17 7 20
##Merge the two data frames but with top_pct as a different variable for each category
#Change the new variable name
for(i in 1:length(res)) {
names(res[[i]])[2] <- paste0(v1Levels[i], "_top_pct")
}
#Merge the results
res_m <- res[[1]]
for(i in 2:length(res)) {
res_m <- merge(res_m, res[[i]], by = "custID", all = TRUE)
}
print(res_m)
# custID A_top_pct B_top_pct
# 1 1 10 10
# 2 2 20 20
# 3 6 NA 10
# 4 7 NA 20
The idiomatic way to do this kind of thing in R would be to use a combination of
split
andlapply
. You're halfway there with your use oflapply
; you just need to usesplit
as well.Finding quantiles is done with
quantile
.You don't need the function
pf
to achieve what you want. Trydplyr/tidyr
comboStick to your Stata instincts and use a single data set:
You can see the result by typing
DT
.From here, you can group the within-
v1
rank,r
, if you want to. Following Stata idioms...This is like
gen
and then tworeplace ... if
statements. Again, you can see the result withDT
.Finally, you can subset with
which gives
These steps can also be chained together:
(Thanks to @ExperimenteR:)
To rearrange for the desired output in the OP, with values of
v1
in columns, usedcast
:Currently,
dcast
requires the latest version ofdata.table
, available (I think) from Github.