“reduce” a set of rows in Hive to another set of r

I'm using Hive for batch-processing of my spatial database. My trace table looks something like this:

object | lat  | long  | timestamp

1      | X11  | X12   | T11

1      | X21  | X22   | T12

2      | X11  | X12   | T21

1      | X31  | X22   | T13

2      | X21  | X22   | T22

I want to map each lat long of each object to a number (think about map-matching for example), but the algorithm needs to consider a number of adjacent data points to get the result. For example, I need all 3 data points of object 1 to map each of those 3 data points to a number. Can't process them one by one.

I'm thinking of using map-reduce with hive using transform, but I'm not sure how to this. Can someone please help me out?

You can use the custom map reduce functionality in Hive.

With the following:

add file /some/path/identity.pl;
add file /some/path/collect.pl;

from (
  from trace_input
  MAP id, lat, lon, ts
  USING './identity.pl'
  as id, lat, lon, ts
 CLUSTER BY id) map_output
REDUCE id, lat, lon, ts
USING './collect.pl' as id, list

trace_input contains your trace data as described above:

create table trace_input(id string, lat string, lon string, ts string)
row format delimited
fields terminated by '\t'
stored as textfile ;

identity.pl is a simple script to dump out each line (could also be a script to select just the lat, long fields):

#!/usr/bin/perl
while (<STDIN>) {
    print;
}

collect.pl (sample here) is simple script which collects consecutive lines with the same object id, saves the remainder of each line, and dumps out a line with id and comma separated list (tab separator).

The cluster by clause will assure the reducers get the correctly sorted input needed by the collect script.

The output of the user scripts are tab separated STRING columns.

Running the query, will result in the following output:

1       X11,X12,T11,X21,X22,T12,X31,X22,T13
2       X11,X12,T21,X21,X22,T22

You can modify the map script to limit the columns, and/or modify the reduce script to add results or separate the lat, lon from the ts, etc.

If this form is sufficient, you could insert directly into a result table by adding an insert before the reduce:

from (
  from trace_input
  MAP id, lat, lon, ts
  USING './identity.pl'
  as id, lat, lon, ts
 CLUSTER BY id) map_output
INSERT overwrite table trace_res
REDUCE id, lat, lon, ts
USING './collect.pl';

The fields will be converted from string fields to match the schema of trace_res as necessary.

If you use collection types like I do, you can also do something like:

create table trace_res as
select sq.id, split(sq.list,",") from
(
from (
  from trace_input
  MAP id, lat, lon, ts
  USING './identity.pl'
  as id, lat, lon, ts
 CLUSTER BY id) map_output
REDUCE id, lat, lon, ts
USING './collect.pl' as (id int, list string)
) sq;

This second field in the created table will be a list of all the lat, lon, ts; but probably will have a more complex table than that.