I have a dataset which contains weekly sale of various products by outlet. Here is how the data looks like:
Store ID Week ID Item Code Sales in $
253422 191 41130 2.95
272568 188 41130 2.95
272568 188 41160 2.95
272568 189 41130 2.95
272568 189 41160 2.95
272568 190 41160 2.95
217460 188 41110 2.95
217460 188 41130 5.9
217460 188 41160 5.9
217460 189 41110 11.8
217460 189 41130 8.85
217460 189 41160 11.8
217460 191 41130 5.95
217460 191 41160 8.93
This is a very large dataset and I would like to generate a summary output which gives me the ITEM wise total sales and the number of stores in which the item is present. I tried the following, but that doesn't work because I get a store count which is repeated due to the repetition of weeks in the dataset:
dataset %>% group_by(Store ID) %>% summarize(count(Item Code))
Any help is highly appreciated.
Thanks
You can do this with aggregate
## Recreate your data
df = read.table(text="'Store ID' 'Week ID' 'Item Code' 'Sales in Dollars'
253422 191 41130 2.95
272568 188 41130 2.95
272568 188 41160 2.95
272568 189 41130 2.95
272568 189 41160 2.95
272568 190 41160 2.95
217460 188 41110 2.95
217460 188 41130 5.9
217460 188 41160 5.9
217460 189 41110 11.8
217460 189 41130 8.85
217460 189 41160 11.8
217460 191 41130 5.95
217460 191 41160 8.93",
header=TRUE)
aggregate(df$Sales.in.Dollars, list(df$Item.Code), sum)
Group.1 x
1 41110 14.75
2 41130 29.55
3 41160 35.48
StoreItems = unique(df[,c(1,3)])
aggregate(StoreItems$Store.ID, list(StoreItems$Item.Code), length)
Group.1 x
1 41110 1
2 41130 3
3 41160 2
Here's a way to do it using dplyr
library(dplyr)
df <- tibble::tribble(
~store_id, ~week_id, ~item_code, ~sales,
253422L, 191L, 41130L, 2.95,
272568L, 188L, 41130L, 2.95,
272568L, 188L, 41160L, 2.95,
272568L, 189L, 41130L, 2.95,
272568L, 189L, 41160L, 2.95,
272568L, 190L, 41160L, 2.95,
217460L, 188L, 41110L, 2.95,
217460L, 188L, 41130L, 5.9,
217460L, 188L, 41160L, 5.9,
217460L, 189L, 41110L, 11.8,
217460L, 189L, 41130L, 8.85,
217460L, 189L, 41160L, 11.8,
217460L, 191L, 41130L, 5.95,
217460L, 191L, 41160L, 8.93
)
df %>%
group_by(item_code) %>%
summarise(total_sales = sum(sales),
stores = n_distinct(store_id))
#> # A tibble: 3 x 3
#> item_code total_sales stores
#> <int> <dbl> <int>
#> 1 41110 14.75 1
#> 2 41130 29.55 3
#> 3 41160 35.48 2