given my input data in userid,itemid format:
raw: {userid: bytearray,itemid: bytearray}
dump raw;
(A,1)
(A,2)
(A,4)
(A,5)
(B,2)
(B,3)
(B,5)
(C,1)
(C,5)
grpd = GROUP raw BY userid;
dump grpd;
(A,{(A,1),(A,2),(A,4),(A,5)})
(B,{(B,2),(B,3),(B,5)})
(C,{(C,1),(C,5)})
I'd like to generate all of the combinations(order not important) of items within each group. I eventually intend on performing jaccard similarity on the items in my group.
ideally my the bigrams would be generated and then I'd FLATTEN the output to look like:
(A, (1,2))
(A, (1,3))
(A, (1,4))
(A, (2,3))
(A, (2,4))
(A, (3,4))
(B, (1,2))
(B, (2,3))
(B, (3,5))
(C, (1,5))
The letters ABC, which represent the userid, are not really necessary for the output, I'm just showing them for illustrative purposes. From there, I would count the number of occurrences of each bigram in order to compute jaccard. I'd love to know if anyone else is using pig for similar similarity calcs(sorry!) and have encountered this already.
I've looked at the NGramGenerator that's supplied with the pig tutorials but it doesn't really match what I'm trying to accomplish. I'm wondering if perhaps a python streaming UDF is the way to go.
You are definitely going to have to write a UDF (in Python or Java, either would be fine). You would want it to work on a bag, and then output a bag (if you flatten a bag of touples, you will get output rows so it will give you the output that you want).
the UDF itself would not be terribly difficult...something like
and then just cast things and return them appropriately.
If you need any help making a simple python udf, it's not too bad. Check here: http://pig.apache.org/docs/r0.8.0/udf.html
And of course feel free to ask for more help here