generating bigram combinations from grouped data i

given my input data in userid,itemid format:

raw: {userid: bytearray,itemid: bytearray}

dump raw;
(A,1)
(A,2)
(A,4)
(A,5)
(B,2)
(B,3)
(B,5)
(C,1)
(C,5)

grpd = GROUP raw BY userid;

dump grpd;

(A,{(A,1),(A,2),(A,4),(A,5)})
(B,{(B,2),(B,3),(B,5)})
(C,{(C,1),(C,5)})

I'd like to generate all of the combinations(order not important) of items within each group. I eventually intend on performing jaccard similarity on the items in my group.

ideally my the bigrams would be generated and then I'd FLATTEN the output to look like:

(A, (1,2))
(A, (1,3))
(A, (1,4))
(A, (2,3))
(A, (2,4))
(A, (3,4))
(B, (1,2))
(B, (2,3))
(B, (3,5))
(C, (1,5))

The letters ABC, which represent the userid, are not really necessary for the output, I'm just showing them for illustrative purposes. From there, I would count the number of occurrences of each bigram in order to compute jaccard. I'd love to know if anyone else is using pig for similar similarity calcs(sorry!) and have encountered this already.

I've looked at the NGramGenerator that's supplied with the pig tutorials but it doesn't really match what I'm trying to accomplish. I'm wondering if perhaps a python streaming UDF is the way to go.

标签： hadoop apache-pig similarity recommendation-engine

1条回答

迷人小祖宗

2楼-- · 2019-05-14 17:39

You are definitely going to have to write a UDF (in Python or Java, either would be fine). You would want it to work on a bag, and then output a bag (if you flatten a bag of touples, you will get output rows so it will give you the output that you want).

the UDF itself would not be terribly difficult...something like

letter, number = zip(*input_touples)
number = list(set(number)

for i in range(0,len(number)):
    for j in range(i,len(number)):
        res.append((number[i],number[j]))

and then just cast things and return them appropriately.

If you need any help making a simple python udf, it's not too bad. Check here: http://pig.apache.org/docs/r0.8.0/udf.html

And of course feel free to ask for more help here

0人赞添加讨论(0) 举报

generating bigram combinations from grouped data i

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间