Is it possible to run apriori association rule in

2019-04-12 14:20发布

Database:

Transacation#   Items List
T1              butter
T1              jam
T2              butter
T3              bread
T3              ice cream
T4              butter
T4              jam

In the above table, Is it possible to run apriori association rule in mysql statement?

For example, the support of buys(T, butter) --> buys(T, jam) = 50%

because there are 4 transactions and T1, T4 satisfy "support" rule.

Can i just use a sql statement to find out such result?

标签: mysql apriori
2条回答
何必那么认真
2楼-- · 2019-04-12 15:10

Yes, you can use SQL to find the support of a single item. But if you want to find itemsets containing more than one item, it would be difficult.

For example, if you had transactions containing several items and you want to find the support of "jam" with "milk" and "bread" appearing together, then it is better to use an algorithm like Apriori, or a faster algorithm like FPGrowth.

查看更多
手持菜刀,她持情操
3楼-- · 2019-04-12 15:18

I arrive at 66% for the sample data you gave? There are 3 transactions for "butter", and out of those only 2 included "jam".

I used the following test table.

create table transactions(
   trans_no     varchar(5)  not null
  ,item         varchar(20) not null
  ,primary key(trans_no, item)
);

insert into transactions(trans_no, item)
values ('T1', 'butter')
      ,('T1', 'jam')
      ,('T2', 'butter')
      ,('T3', 'bread')
      ,('T3', 'ice cream')
      ,('T4', 'butter')
      ,('T4', 'jam');

The following is my attempt at an answer. The inner select find all transactions that included "butter". For each such transaction, it also sets a flag (bought_jam) saying whether that transaction also included "jam". (The having clause excludes transactions that includes "jam" but not "butter").
In the outer select, I basically count all rows (the count corresponds to the number of transactions including butter), and sums the jam flag, which corresponds to the number of transactions including both butter and jam.

select sum(bought_jam) as jams_bought
      ,count(*) as num_trans
      ,100 * sum(bought_jam) / count(*) as correlation_pct
  from (select trans_no
              ,max(case when item = 'jam' then 1 else 0 end) as bought_jam
          from transactions
         where item in('butter', 'jam')
         group 
            by trans_no
        having min(case when item = 'butter' then item end) = 'butter'
       ) butter_trans;

The query above gives the following result:

+-------------+-----------+-----------------+
| jams_bought | num_trans | correlation_pct |
+-------------+-----------+-----------------+
|           2 |         3 |         66.6667 |
+-------------+-----------+-----------------+
1 row in set (0.00 sec)

Let me know how this works out for you.

Edit:
The following query would give the same results, but is much easier to read. However, if the transactions table is very large, and the item = x isn't very selective (returns lots of rows), this query would almost certainly be slower.

select count(t2.trans_no) as jams_bought
      ,count(*) as num_trans
      ,count(t2.trans_no) / count(*) as correlation_pct
  from transactions t1
  left join transactions t2 on(t2.trans_no = t1.trans_no and t2.item = 'jam')
 where t1.item = 'butter';
查看更多
登录 后发表回答