PIG Script REPLACE with pipe symbol

2020-03-26 05:16发布

问题:

I want to strip characters outside of the curly brackets in rows that look like the following.

35|{......}|

Stripping the '35|' from the front and the trailing '|' from the end.

{.....}

Initially working on the first 3 characters, I try the following but it removes everything.

 a = LOAD '/file' as (line1:chararray);

 b = FOREACH x generate REPLACE(line1, '35|','');

 dump b;

Any thoughts appreciated. Thanks.

回答1:

| and { and } are special characters in regular expressions and the second parameter for REPLACE is a regular expression. Try to escape the characters:

b = FOREACH x generate REPLACE(line1, '35\\|','');


回答2:

Some more information : If you want to transform data into a more complex form which cant be achieved simply by REPLACE , you can create a Javascript/Java/Jython/Ruby/Groovy/Python User Defined Function (UDF) which takes your data as input and returns the processed data.

Example of Javascript UDF:

Pig Script:

 --including the js file containing the UDF
 register 'test.js' using javascript as myfuncs;

 a = LOAD '/file' as (line1:chararray);

 --Processing each line1 by calling UDF
 b = FOREACH x generate myfuncs.processData(line1);
 dump b;

test.js

 processData.outputSchema = "word:chararray,num:long";

 function processData(word){
    return {word:word, num:word.length};
 }

To see how UDFs work please check this : Pig Documentation for UDF



回答3:

You could use REGEX_EXTRACT :

REGEX_EXTRACT(line1, '.*(\\{.*\\}).*', 1);

http://pig.apache.org/docs/r0.12.1/func.html#regex-extract