Remove single quotes from data using Pig

2020-07-22 18:38发布

问题:

This is what my data looks like

(10, 'ACCOUNTING', 'NEW YORK')
(20, 'RESEARCH', 'DALLAS')
(30, 'SALES', 'CHICAGO')
(40, 'OPERATIONS', 'BOSTON')

I want to remove (, ) and ' from this data using Pig Script. I want my data to look like this-

10, ACCOUNTING, NEW YORK
20, RESEARCH, DALLAS
30, SALES, CHICAGO
40, OPERATIONS, BOSTON

I am stuck on this from quite long time. Please help. Thanks in advance.

回答1:

Can you try REPLACE function with the below regex?

Explanation:
In Regex there are few characters have special meanings \ ^ $ . , | ? * + ( ) [ {. These special characters are called as "metacharacters". If you want to use any of these characters as part of your regex, then you need to escape them with a single backslash. In our case Pig uses Java based regex engine so all the specials characters needs be escaped with double backslash (Java uses \\ double backslash to differentiate the special characters ).

To remove '(' ')' and '(single quote) characters from your input.
1. Just Replace () with double backslash \\(\\).
2. '(single quote) is special character in Pig(default string literal), so this also required double backslash to remove the special meaning but double backslash doesn't convince pig parser(you will get error for double backslash) that is the reason i used three backslash for single quote \\\' to remove the special meaning.
3. [] is character class, this will match only one out of several characters. Simply place the characters inside the square bracket that you want to match ie. in our case its [()'].
4. + symbol is for matching one or more characters.

input

(10, 'ACCOUNTING', 'NEW YORK')
(20, 'RESEARCH', 'DALLAS')
(30, 'SALES', 'CHICAGO')
(40, 'OPERATIONS', 'BOSTON')

PigScript1:

A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(line,'[\\\'\\(\\)]+','');
STORE B INTO 'output';

Pigscript2:

A = LOAD 'input' USING PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray);
B = FOREACH A GENERATE REPLACE(col1,'[\\(]+',''),REPLACE(col2,'[\\\']',''),REPLACE(col3,'[\\)\\\']+','');
STORE B into 'output1' USING PigStorage(',');

Output: will be stored in output/part-m-00000 file

10, ACCOUNTING, NEW YORK
20, RESEARCH, DALLAS
30, SALES, CHICAGO
40, OPERATIONS, BOSTON


回答2:

When your data has double quote instead of the single quote following will work.

loaded = LOAD 'input' AS (row:chararray);
formatted = FOREACH loaded GENERATE REPLACE(row,'[\\"\\(\\)]+','');
STORE formatted INTO 'hdfs_output';

--And to use corrected data you can load it as usual

formatted_data = LOAD 'hdfs_output' USING PigStorage('$DELIMITER','-noschema') AS (column_name:datatype,..etc);


回答3:

When your data has double quotes here is potentially easier way as it doesn't use regex:

A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(line,'"','');
STORE B INTO 'output';


回答4:

When you are aware that your first column is holding numeric values and the rest are holding characters, you can adopt the following approach also.

Loading the file with delimiter as ,

sam = load 'sample.txt' using PigStorage(',');

sam1 = foreach sam generate REPLACE($0,'([^0-9\\s]+)',''),REPLACE($1,'([^A-Za-z\\s]+)',''),REPLACE($2,'([^A-Za-z\\s]+)','');

You will get the following output:

(10, ACCOUNTING, NEW YORK )

(20, RESEARCH, DALLAS )

(30, SALES, CHICAGO )

(40, OPERATIONS, BOSTON)