pig udf to calculate time difference in weblogs

2019-08-16 01:29发布

问题:

Is there a Pig UDF that calculates time difference in the weblogs?

Assuming I have weblogs in the below format:

10.171.100.10 - - [12/Jan/2012:14:39:46 +0530] "GET /amazon/navigator/index.php
 HTTP/1.1" 200 402 "someurl/page1" "Mozilla/4.0 (
compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.4506
.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:41:47 +0530] "GET /amazon/header.php HTTP/1.1
" 200 4376 "someurl/page2" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.450
6.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:44:15 +0530] "GET /amazon/navigator/navigator
.php HTTP/1.1" 200 912 "someurl/page3" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET
 CLR 3.0.4506.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"

The user with IP 10.171.100.10 visited somurl/page1 at 12/Jan/2012:14:39:46 (1st entry in weblogs). Next user visited someurl/page2 at 12/Jan/2012:14:41:47. So, the user stayed on page1 for 2mts 1sec. Similarly user stayed on page2 for 2mts 28 secs (14.44:15 - 14:41.47). I don't care about how long the user stayed on page3 as I have nothing to compare it with. The output can be:

10.171.100.10 someurl/page1 121 sec 
10.171.100.10 someurl/page2 148 sec etc ..

The weblogs will have millions of lines and the IP's will not necessarily be in a sorted order. Any suggestions on how to go about it using Pig UDF's or any other technology?

回答1:

I don't know any function that would by default use the content from following rows to generate some content, as the sequence is variable and thus highly unreliable.

You have to write your own UDF. To optimize the calculation (if you have billions of lines), you may want to ORDER by IP and date, and to GROUP your data set by IP and before starting a MapReduce job on each IP (or IP group) to ensure that all the rows corresponding to a particular IP are processed by the same node.

Also, I would advise you to think a bit longer about the rules you want to use to calculate the time spent on a page: when is a user still active and when is a user returning? You may end up with very long time ranges.