I'm trying to use pig to break text into lowercased words, and then look up each word in a map. Here's my example map, which I have in map.txt (it is only 1 line long):
[this#1.9,is#2.5my#3.3,vocabulary#4.1]
I load this like so:
M = LOAD 'mapping.txt' USING PigStorage AS (mp: map[float]);
which works just fine. Then I do the following to load the text and break it into lowercased words:
LINES = LOAD 'test.txt' USING TextLoader() AS (line:chararray);
TOKENS = FOREACH LINES GENERATE FLATTEN(TOKENIZE(LOWER(line))) as (word:chararray);
Now, I'd like to do something like this:
RESULTS = FOREACH TOKENS GENERATE M.mp#word;
so that if I have a line like "this my my vocabulary", I'd get the following output: 1 3 3 4 , but I keep getting various errors. How can I look up variable values in a map?
I've looked at How can I use the map datatype in Apache Pig? and http://pig.apache.org/docs/r0.10.0/basic.html#map-schema , but these only help if I'm looking up a fixed value in a map, for example M.mp#'this', which is not what I want to do here.
You can also FLATTEN M and then JOIN M and LINES based on Token/word (you can do a 'replicated' join on M so it would be copies to each mapper