SQL query Frequency Distribution matrix for produc

2020-08-24 07:08发布

问题:

i want to create a frequency distribution matrix

1.Create a matrix.**Is it possible to get this in separate columns**

  customer1       p1         p2      p3
  customer 2      p2         p3
  customer 3      p2         p3      p1
  customer 4      p2         p1

2. Then I have to count the number of products that come together the most

   For eg  
    p2 and p3 comes together 3 times
    p1 p3   comes 2 times
    p1 p2  comes  2 times

I want to recommend products to customers ,frequency of products that comes together

 select customerId,product,count(*) from sales group by customerId,product

Can anyone please help me for a solution to this

回答1:

If you want pairs of products that customers purchase, then you can use a self join:

select s1.product, s2.product, count(*) as cnt
from sales s1 join
     sales s2
     on s1.customerId = s2.customerId
where s1.product < s2.product
group by s1.product, s2.product
order by cnt desc;

You can extend this to more than two products by using more joins.



回答2:

1.

Array

select      customerId
           ,sort_array(collect_set (product))   as products

from        sales 

group by    customerId
;

Multiple columns

select      customerId

           ,products[0] as p0
           ,products[1] as p1
           ,products[2] as p2
           ,products[3] as p3
           ,products[4] as p4
           ,products[5] as p5
           ,products[6] as p6
           ,products[7] as p7
           ,products[8] as p8
           ,products[9] as p9

from       (select      customerId
                       ,sort_array(collect_set (product))   as products

            from        sales 

            group by    customerId
            ) s
;

2.

This returns the frequency of the whole products' combinations.
In your example (p1,p2,p3) is the most frequent (appears twice).
(p1,p2) appears once and so is (p2,p3).
For frequency of tuples, see @GordonLinoff answer.

select      s.products
           ,count(*)    as frequency 

from       (select      customerId
                       ,sort_array(collect_set (product))   as products

            from        sales 

            group by    customerId
            ) s

group by    s.products           

order by    frequency desc


回答3:

I know cursors are not the flavor of the month anymore - but I still find them very useful from time to time

Below code creates a table variable and then loops through the sales data to find out which combination of products are sold together most frequently

--Results will be loaded into this table
DECLARE @OutputCombination TABLE (CombinationName VARCHAR(MAX), CombinationsFound INT) 

--Just for demo - create table and load with data
declare @demodata table (ClientID int, ProductBought varchar(50))
insert into @demodata (Clientid, ProductBought) values 
(1,'Coke'),
(1,'Chips'),
(1,'Chips'),
(2,'Coke'),
(2,'Coke'),
(2,'Sweets'),
(3,'Coke'),
(4,'Chips'),
(5,'Coke'),
(5,'Chips'),
(6,'Coke'),
(7,'Coke')

DECLARE clients CURSOR
READ_ONLY
FOR SELECT DISTINCT clientID from @demodata

DECLARE @clientID INT
OPEN clients

FETCH NEXT FROM clients INTO @clientID
WHILE (@@fetch_status <> -1)
BEGIN
       DECLARE @ThisClientCombination VARCHAR(MAX) --This is going to be used to find/update combinations of products sold from the data - pipe-delimiting them
       SET @ThisClientCombination  = '' --better to actually wipe variables in cursor loops, as cursors sometimes funny about things like that
       ;WITH thisClientsDistinctCoverages AS --CTE used because of the SQL-funnies caused when using DISTINCT and ORDER BY
       (
              SELECT DISTINCT TOP 100 PERCENT CoverageItems = ProductBought + '|' FROM @demodata WHERE ClientID = @clientID ORDER BY ProductBought + '|' --order alphabetically and do DISTINCT so that 2x cokes are changed into just 1 coke - for the sake of combos
       )
       SELECT @ThisClientCombination = @ThisClientCombination + CoverageItems FROM thisClientsDistinctCoverages
       SET @ThisClientCombination = LEFT(@ThisClientCombination,LEN(@ThisClientCombination)-1) --Drop trailing Pipe '|'

       IF EXISTS(SELECT * FROM @OutputCombination WHERE CombinationName = @ThisClientCombination)
              UPDATE @OutputCombination SET CombinationsFound = CombinationsFound + 1 WHERE CombinationName = @ThisClientCombination --found before, increase count by 1
       ELSE
              INSERT INTO @OutputCombination (CombinationName, CombinationsFound) VALUES (@ThisClientCombination, 1)--first time found, make entry 1

       FETCH NEXT FROM clients INTO @clientID
END

CLOSE clients
DEALLOCATE clients

--Show the results
SELECT * FROM @OutputCombination ORDER BY CombinationsFound DESC, CombinationName

This will yield the results:

CombinationName---------CombinationsFound
Coke--------------------------3
Chips|Coke------------------2
Chips--------------------------1
Coke|Sweets----------------1

Interpretation of above results:
Most often, 3 times, a customer only bought coke(s)
Next up, twice a customer bought chips and a coke
Once a customer only bought chips
And once a customer bought a coke & sweets