Generating dummy webshop data in R: Incorporating

2019-06-13 20:26发布

For a course I am currently in I am trying to build a dummy transaction, customer & product dataset to showcase a machine learning usecase in a webshop environment as well as a financial dashboard; unfortunately, we have not been given dummy data. I figured this'd be a nice way to improve my R knowledge, but am experiencing severe difficulties in realizing it.

The idea is that I specify some parameters/rules (arbitrary/fictitious, but applicable for a demonstration of a certain clustering algorithm). I'm basically trying to hide a pattern to then re-find this pattern utilizing machine learning (not part of this question). The pattern I'm hiding is based on the product adoption life cycle, attempting to show how identifying different customer types could be used for targeted marketing purposes.

I'll demonstrate what I'm looking for. I'd like to keep it as realistic as possible. I attempted to do so by assigning the number of transactions per customer and other characteristics to normal distributions; I am completely open to potential other ways to do this?

The following is how far I have come, first build a table of customers:

# Define Customer Types & Respective probabilities
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15)   # Probability of being in each group.

set.seed(1)   # Set seed to make reproducible
Customers <- data.frame(ID=(1:10000), 
  CustomerType = sample(CustomerTypes, size=10000,
                                  replace=TRUE, prob=PropCustTypes),
  NumBought = rnorm(10000,3,2)   # Number of Transactions to Generate, open to alternative solutions?
)
Customers[Customers$Numbought<0]$NumBought <- 0   # Cap NumBought at 0 

Next, generate a table of products to choose from:

Products <- data.frame(
  ID=(1:50),
  DateReleased = rep(as.Date("2012-12-12"),50)+rnorm(50,0,8000),
  SuggestedPrice = rnorm(50, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10   # Cap ProductPrice at 10$
Products[Products$DateReleased<as.Date("2013-04-10"),]$DateReleased <- as.Date("2013-04-10")   # Cap Releasedate to 1 year ago 

Now I would like to generate n transactions (number is in customer table above), based on the following parameters for each variable that is currently relevant).

Parameters <- data.frame(
  CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
  BySearchEngine   = c(0.10, .40, 0.50, 0.6), # Probability of coming through channel X
  ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
  ByPartnerBlog    = c(0.30, .30,  0.35, 0.35),
  Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
  Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
    stringsAsFactors=FALSE)

Parameters
   CustomerType BySearchEngine ByDirectCustomer ByPartnerBlog Timeliness Discount
1  EarlyAdopter            0.1             0.60          0.30          1     0.00
2   Pragmatists            0.4             0.30          0.30          6     0.00
3 Conservatives            0.5             0.15          0.35         12     0.05
4    Dealseeker            0.6             0.05          0.35         12     0.10

The idea is that 'EarlyAdopters' would have (on average, normally distributed) 10% of transactions with a label 'BySearchEngine', 60% 'ByDirectCustomer' and 30% 'ByPartnerBlog'; these values need to exclude each other: one cannot be obtained via both a PartnerBlog and via a Search Engine in the final dataset. The options are:

ObtainedBy <- c("SearchEngine","DirectCustomer","PartnerBlog")

Furthermore, I'd like to generate a discount variable that is normally distributed utilizing the above means. For simplicity, standard deviations may be mean/5.

Next, my most tricky part, I'd like to generate these transactions according to a few rules:

  • Somewhat evenly distributed over days, maybe slightly more during the weekend;
  • Spread out between 2006-2014.
  • Spreading out the # of transactions of customers over the years;
  • Customers cannot buy products that haven't been released yet.

Other Parameters:

YearlyMax <- 1 # ? How would I specify this, a growing number would be even nicer?
DailyMax <-  1 # Same question? Likely dependent on YearlyMax

The result for CustomerID 2 would be:

Transactions <- data.frame(
    ID        = c(1,2),
    CustomerID = c(2,2), # The customer that bought the item.
    ProductID = c(51,100), # Products chosen to approach customer type's Timeliness average
    DateOfPurchase = c("2013-01-02", "2012-12-03"), # Date chosen to mimic timeliness average
    ReferredBy = c("DirectCustomer", "SearchEngine"), # See above, follows proportions previously identified.
    GrossPrice = c(50,52.99), # based on Product Price, no real restrictions other than using it for my financial dashboard.
    Discount = c(0.02, 0.0)) # Chosen to mimic customer type's discount behavior.    

Transactions
  ID CustomerID ProductID DateOfPurchase     ReferredBy GrossPrice Discount
1  1          2        51     2013-01-02 DirectCustomer      50.00     0.02
2  2          2       100     2012-12-03   SearchEngine      52.99     0.00

I'm getting more and more confident in writing R code, but I'm having difficulties writing the code to keep the global parameters (daily distributions of transactions, yearly maximum of # transactions per customer) as well as the various linkages in line:

  • Timeliness: how quick people purchase after release
  • ReferredBy: how did this customer arrive to my website?
  • How much discount has the customer had (to illustrate how sensitive one is to discounts)

This causes me to not know whether I should write a for loop over the customer table, generating transactions per customer, or whether I should take a different route. Any contributions are greatly appreciated. Alternative dummy datasets are welcome as well, even though I'm eager to solve this problem by means of R. I'll keep this post updated as I progress.

My current pseudocode:

  • Assign customer to customer type with sample()
  • Generate Customers$NumBought transactions
  • ... Still thinking?

EDIT: Generating the transactions table, now I 'just' need to fill it with the right data:

Tr <- data.frame(
  ID = 1:sum(Customers$NumBought),
  CustomerID = NA,
  DateOfPurchase = NA,
  ReferredBy = NA,
  GrossPrice=NA,
  Discount=NA)

2条回答
ら.Afraid
2楼-- · 2019-06-13 21:05

Following Gavin, I solved the issue with the following code:

First instantiate the CustomerTypes:

require(lubridate)
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15)   # Probability for being in each group.

Set the parameters for my customer types

set.seed(1)   # Set seed to make reproducible
Parameters <- data.frame(
  CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
  BySearchEngine   = c(0.10, .40, 0.50, 0.6), # Probability of choosing channel X
  ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
  ByPartnerBlog    = c(0.30, .30,  0.35, 0.35),
  Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
  Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
  stringsAsFactors=FALSE)

Describe the number of visitors

TotalVisits <- 20000
NumDays <- 100
StartDate <- as.Date("2009-01-04")
NumProducts <- 100
StartProductRelease <- as.Date("2007-01-04") # As products will be selected based on     this, make sure
                                             # we include a few years prior as people will buy products older than 2 years?
AnnualGrowth <- 0.15

Now, as suggested, build a dataset of days. I added DaysSinceStart to use it in growing the business over time.

days <- data.frame(
  day            = StartDate+1:NumDays, 
  DaysSinceStart = StartDate+1:NumDays - StartDate,
  CustomerRate = TotalVisits/NumDays)

days$nPurchases <- rpois(NumDays, days$CustomerRate)
days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <- # Increase sales in weekends
  as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*1.5)

Now build transactions from these days.

Transactions <- data.frame(
  ID           = 1:sum(days$nPurchases),
  Date         = rep(days$day, times=days$nPurchases),
  CustomerType = sample(CustomerTypes, sum(days$nPurchases), replace=TRUE, prob=PropCustTypes),
  NewCustomer  = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)),
  CustomerID   = NA,
  ProductID = NA,
  ReferredBy = NA)
Transactions$CustomerType <- as.character(Transactions$CustomerType)

Transactions <- merge(Transactions,Parameters, by="CustomerType") # Append probabilities to table for use in 'sample', haven't found a better way to vlookup?

Initiate some customers we can choose from when not new.

Customers <- data.frame(ID=(1:100), 
                        CustomerType = sample(CustomerTypes, size=100,
                                              replace=TRUE, prob=PropCustTypes)
); Customers$CustomerType <- as.character(Customers$CustomerType)
# Now make a new customer if transaction is with new customer, otherwise choose one with the right type.

Make up a buch of products to choose from, with evenly divided release dates

ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease))
Upper <- max(ReleaseRange)
Lower <- min(ReleaseRange)
Products <- data.frame(
  ID = 1:NumProducts,
  DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts,
                                         as.numeric(Upper-Lower),
                                         as.numeric(Upper-Lower)/NumProducts))),
  SuggestedPrice = rnorm(NumProducts, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10   # Cap ProductPrice at 10$

ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog")

Now I loop over the newly created Transaction data.frame, choosing from available products (measured by purchase date - average timeliness (in months) * 30 days +/- 15 days. I also assign new customers to a new CustomerID and choose from existing customers if it is not new. Other fields are determined by the parameters above.

Start.time <- Sys.time()
for (i in 1:length(Transactions$ID)){

  if (Transactions[i,]$NewCustomer==1){
    NewCustomerID <- max(Customers$ID, na.rm=T)+1
    Customers[NewCustomerID,]$ID = NewCustomerID
    Transactions[i,]$CustomerID <- NewCustomerID
    Customers[NewCustomerID,]$CustomerType <- Transactions[i,]$CustomerType
  }
  if (Transactions[i,]$NewCustomer==0){
    Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
                                          1,replace=FALSE)
  }
  Transactions[i,]$Discount <- rnorm(1,Transactions[i,]$Discount,Transactions[i,]$Discount/20)
  Transactions[i,]$Timeliness <- rnorm(1,Transactions[i,]$Timeliness, Transactions[i,]$Timeliness/6)
  Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE,
                               prob=Current[,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])

  CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
  ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
  Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}
Elapsed <- Sys.time()-Start.time
length(Transactions$ID)

And it's done! Unfortunately it takes ~ 22 minutes on a dataset of 20,000 products sold in 100 days. Not necessarily a problem, but I'm very much interested in potential improvements.

查看更多
闹够了就滚
3楼-- · 2019-06-13 21:13

Very roughly, set up an database of days, and number of visits in that day:

days<- data.frame(day=1:8000, customerRate = 8000/XtotalNumberOfVisits)
# you could change the customerRate to reflect promotions, time since launch, ...
days$nVisits <- rpois(8000, days$customerRate)

Then catalogue the visits

    visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
    visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
    visits$nPurchases <- rpois(nrow(vists), XpurchaseRate[visits$customerType])

Any of the variables with X in front of them are parameters of your process. You'd similarly go on to generate a transactions database by parametrising the relative likelihood amongst objects available, according to the other columns you have. Or you can generate a visits database including a key to each product available at that day:

   productRelease <- data.frame(id=X, releaseDay=sort(X)) # ie df is sorted by releaseDay
   visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
   visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
   day$productsAvailable = rep(1:nrow(productRelease), times=diff(c(productRelease$releaseDay, nrow(days)+1)))
   visits <- visits[(1:nrow(visits))[day$productsAvailable],]
   visits$prodID <- with(visits, ave(rep(id==id, id, cumsum))

You can then decide a function that gives you, for each row, a probability of the customer purchasing that item (based on day, customer, product). And then fill in the purchase by `visits$didTheyPurchase <- runif(nrow(visits)) < XmyProbability.

Sorry, there's probably typos's littered throughout this as I was typing it straight, but hopefully this gives you an idea.

查看更多
登录 后发表回答