Is there a Pig UDF that calculates time difference in the weblogs?
Assuming I have weblogs in the below format:
10.171.100.10 - - [12/Jan/2012:14:39:46 +0530] "GET /amazon/navigator/index.php
HTTP/1.1" 200 402 "someurl/page1" "Mozilla/4.0 (
compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.4506
.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:41:47 +0530] "GET /amazon/header.php HTTP/1.1
" 200 4376 "someurl/page2" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.450
6.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:44:15 +0530] "GET /amazon/navigator/navigator
.php HTTP/1.1" 200 912 "someurl/page3" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET
CLR 3.0.4506.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
The user with IP 10.171.100.10
visited somurl/page1 at 12/Jan/2012:14:39:46
(1st entry in weblogs). Next user visited someurl/page2
at 12/Jan/2012:14:41:47
. So, the user stayed on page1 for 2mts 1sec. Similarly user stayed on page2 for 2mts 28 secs (14.44:15 - 14:41.47). I don't care about how long the user stayed on page3 as I have nothing to compare it with. The output can be:
10.171.100.10 someurl/page1 121 sec
10.171.100.10 someurl/page2 148 sec etc ..
The weblogs will have millions of lines and the IP's will not necessarily be in a sorted order. Any suggestions on how to go about it using Pig UDF's or any other technology?