For a course I am currently in I am trying to build a dummy transaction, customer & product dataset to showcase a machine learning usecase in a webshop environment as well as a financial dashboard; unfortunately, we have not been given dummy data. I figured this'd be a nice way to improve my R knowledge, but am experiencing severe difficulties in realizing it.
The idea is that I specify some parameters/rules (arbitrary/fictitious, but applicable for a demonstration of a certain clustering algorithm). I'm basically trying to hide a pattern to then re-find this pattern utilizing machine learning (not part of this question). The pattern I'm hiding is based on the product adoption life cycle, attempting to show how identifying different customer types could be used for targeted marketing purposes.
I'll demonstrate what I'm looking for. I'd like to keep it as realistic as possible. I attempted to do so by assigning the number of transactions per customer and other characteristics to normal distributions; I am completely open to potential other ways to do this?
The following is how far I have come, first build a table of customers:
# Define Customer Types & Respective probabilities
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15) # Probability of being in each group.
set.seed(1) # Set seed to make reproducible
Customers <- data.frame(ID=(1:10000),
CustomerType = sample(CustomerTypes, size=10000,
replace=TRUE, prob=PropCustTypes),
NumBought = rnorm(10000,3,2) # Number of Transactions to Generate, open to alternative solutions?
)
Customers[Customers$Numbought<0]$NumBought <- 0 # Cap NumBought at 0
Next, generate a table of products to choose from:
Products <- data.frame(
ID=(1:50),
DateReleased = rep(as.Date("2012-12-12"),50)+rnorm(50,0,8000),
SuggestedPrice = rnorm(50, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10 # Cap ProductPrice at 10$
Products[Products$DateReleased<as.Date("2013-04-10"),]$DateReleased <- as.Date("2013-04-10") # Cap Releasedate to 1 year ago
Now I would like to generate n transactions (number is in customer table above), based on the following parameters for each variable that is currently relevant).
Parameters <- data.frame(
CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability of coming through channel X
ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
ByPartnerBlog = c(0.30, .30, 0.35, 0.35),
Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
stringsAsFactors=FALSE)
Parameters
CustomerType BySearchEngine ByDirectCustomer ByPartnerBlog Timeliness Discount
1 EarlyAdopter 0.1 0.60 0.30 1 0.00
2 Pragmatists 0.4 0.30 0.30 6 0.00
3 Conservatives 0.5 0.15 0.35 12 0.05
4 Dealseeker 0.6 0.05 0.35 12 0.10
The idea is that 'EarlyAdopters' would have (on average, normally distributed) 10% of transactions with a label 'BySearchEngine', 60% 'ByDirectCustomer' and 30% 'ByPartnerBlog'; these values need to exclude each other: one cannot be obtained via both a PartnerBlog and via a Search Engine in the final dataset. The options are:
ObtainedBy <- c("SearchEngine","DirectCustomer","PartnerBlog")
Furthermore, I'd like to generate a discount variable that is normally distributed utilizing the above means. For simplicity, standard deviations may be mean/5.
Next, my most tricky part, I'd like to generate these transactions according to a few rules:
- Somewhat evenly distributed over days, maybe slightly more during the weekend;
- Spread out between 2006-2014.
- Spreading out the # of transactions of customers over the years;
- Customers cannot buy products that haven't been released yet.
Other Parameters:
YearlyMax <- 1 # ? How would I specify this, a growing number would be even nicer?
DailyMax <- 1 # Same question? Likely dependent on YearlyMax
The result for CustomerID 2 would be:
Transactions <- data.frame(
ID = c(1,2),
CustomerID = c(2,2), # The customer that bought the item.
ProductID = c(51,100), # Products chosen to approach customer type's Timeliness average
DateOfPurchase = c("2013-01-02", "2012-12-03"), # Date chosen to mimic timeliness average
ReferredBy = c("DirectCustomer", "SearchEngine"), # See above, follows proportions previously identified.
GrossPrice = c(50,52.99), # based on Product Price, no real restrictions other than using it for my financial dashboard.
Discount = c(0.02, 0.0)) # Chosen to mimic customer type's discount behavior.
Transactions
ID CustomerID ProductID DateOfPurchase ReferredBy GrossPrice Discount
1 1 2 51 2013-01-02 DirectCustomer 50.00 0.02
2 2 2 100 2012-12-03 SearchEngine 52.99 0.00
I'm getting more and more confident in writing R code, but I'm having difficulties writing the code to keep the global parameters (daily distributions of transactions, yearly maximum of # transactions per customer) as well as the various linkages in line:
- Timeliness: how quick people purchase after release
- ReferredBy: how did this customer arrive to my website?
- How much discount has the customer had (to illustrate how sensitive one is to discounts)
This causes me to not know whether I should write a for loop over the customer table, generating transactions per customer, or whether I should take a different route. Any contributions are greatly appreciated. Alternative dummy datasets are welcome as well, even though I'm eager to solve this problem by means of R. I'll keep this post updated as I progress.
My current pseudocode:
- Assign customer to customer type with sample()
- Generate Customers$NumBought transactions
- ... Still thinking?
EDIT: Generating the transactions table, now I 'just' need to fill it with the right data:
Tr <- data.frame(
ID = 1:sum(Customers$NumBought),
CustomerID = NA,
DateOfPurchase = NA,
ReferredBy = NA,
GrossPrice=NA,
Discount=NA)
Following Gavin, I solved the issue with the following code:
First instantiate the CustomerTypes:
Set the parameters for my customer types
Describe the number of visitors
Now, as suggested, build a dataset of days. I added DaysSinceStart to use it in growing the business over time.
Now build transactions from these days.
Initiate some customers we can choose from when not new.
Make up a buch of products to choose from, with evenly divided release dates
Now I loop over the newly created Transaction data.frame, choosing from available products (measured by purchase date - average timeliness (in months) * 30 days +/- 15 days. I also assign new customers to a new CustomerID and choose from existing customers if it is not new. Other fields are determined by the parameters above.
And it's done! Unfortunately it takes ~ 22 minutes on a dataset of 20,000 products sold in 100 days. Not necessarily a problem, but I'm very much interested in potential improvements.
Very roughly, set up an database of days, and number of visits in that day:
Then catalogue the visits
Any of the variables with
X
in front of them are parameters of your process. You'd similarly go on to generate a transactions database by parametrising the relative likelihood amongst objects available, according to the other columns you have. Or you can generate a visits database including a key to each product available at that day:You can then decide a function that gives you, for each row, a probability of the customer purchasing that item (based on day, customer, product). And then fill in the purchase by `visits$didTheyPurchase <- runif(nrow(visits)) < XmyProbability.
Sorry, there's probably typos's littered throughout this as I was typing it straight, but hopefully this gives you an idea.