I want to strip characters outside of the curly brackets in rows that look like the following.
35|{......}|
Stripping the '35|' from the front and the trailing '|' from the end.
{.....}
Initially working on the first 3 characters, I try the following but it removes everything.
a = LOAD '/file' as (line1:chararray);
b = FOREACH x generate REPLACE(line1, '35|','');
dump b;
Any thoughts appreciated. Thanks.
|
and {
and }
are special characters in regular expressions and the second parameter for REPLACE
is a regular expression. Try to escape the characters:
b = FOREACH x generate REPLACE(line1, '35\\|','');
Some more information : If you want to transform data into a more complex form which cant be achieved simply by REPLACE , you can create a Javascript/Java/Jython/Ruby/Groovy/Python User Defined Function (UDF) which takes your data as input and returns the processed data.
Example of Javascript UDF:
Pig Script:
--including the js file containing the UDF
register 'test.js' using javascript as myfuncs;
a = LOAD '/file' as (line1:chararray);
--Processing each line1 by calling UDF
b = FOREACH x generate myfuncs.processData(line1);
dump b;
test.js
processData.outputSchema = "word:chararray,num:long";
function processData(word){
return {word:word, num:word.length};
}
To see how UDFs work please check this : Pig Documentation for UDF
You could use REGEX_EXTRACT :
REGEX_EXTRACT(line1, '.*(\\{.*\\}).*', 1);
http://pig.apache.org/docs/r0.12.1/func.html#regex-extract