strsplit issue - Pig

2020-07-06 01:58发布

问题:

I have following tuple H1 and I want to strsplit its $0 into tuple.However I always get an error message:

DUMP H1:
(item32;item31;,1)

m = FOREACH H1 GENERATE STRSPLIT($0, ";", 50);

ERROR 1000: Error during parsing. Lexical error at line 1, column 40. Encountered: after : "\";"

Anyone knows what's wrong with the script?

回答1:

There is an escaping problem in the pig parsing routines when it encounters this semicolon.

You can use a unicode escape sequence for a semicolon: \u003B. However this must also be slash escaped and put in a single quoted string. Alternatively, you can rewrite the command over multiple lines, as per Neil's answer. In all cases, this must be a single quoted string.

H1 = LOAD 'h1.txt' as (splitme:chararray, name);

A1 = FOREACH H1 GENERATE STRSPLIT(splitme,'\\u003B'); -- OK
B1 = FOREACH H1 GENERATE STRSPLIT(splitme,';');       -- ERROR
C1 = FOREACH H1 GENERATE STRSPLIT(splitme,':');       -- OK
D1 = FOREACH H1 {                                     -- OK
    splitup = STRSPLIT( splitme, ';' );
    GENERATE splitup;
}

A2 = FOREACH H1 GENERATE STRSPLIT(splitme,"\\u003B"); -- ERROR
B2 = FOREACH H1 GENERATE STRSPLIT(splitme,";");       -- ERROR
C2 = FOREACH H1 GENERATE STRSPLIT(splitme,":");       -- ERROR
D2 = FOREACH H1 {                                     -- ERROR
    splitup = STRSPLIT( splitme, ";" );
    GENERATE splitup;
}

Dump H1;
(item32;item31;,1)

Dump A1;
((item32,item31))

Dump C1;
((item32;item31;))

Dump D1;
((item32,item31))


回答2:

STRSPLIT on a semi-colon is tricky. I got it to work by putting it inside of a block.

raw = LOAD 'cname.txt' as (name,cname_string:chararray);

xx = FOREACH raw {
  cname_split = STRSPLIT(cname_string,';');
  GENERATE cname_split;
}

Funny enough, this is how I originally implemented my STRSPLIT() command. Only after trying to get it to split on a semicolon did I run into the same issue.



标签: apache-pig