My csv files have header in the first line. Loading them into pig create a mess on any subsequent functions (like SUM). As of today I first apply a filter on the loaded data to remove the rows containing the headers :
affaires = load 'affaires.csv' using PigStorage(',') as (NU_AFFA:chararray, date:chararray) ;
affaires = filter affaires by date matches '../../..';
I think it is a bit stupid as a method, and I am wondering either there is a way to tell pig not to load the first line of the csv, like a "as_header" boolean parameter to the load function.
I don't see it on the doc. What would be a best practice ? How do you usually deal with that ??
CSVExcelStorage
loader support to skip the header row, so instead of PigStorage
use CSVExcelStorage
. Download piggybank.jar
and try this option.
Sample example
input.csv
Name,Age,Location
a,10,chennai
b,20,banglore
PigScript:(With SKIP_INPUT_HEADER option)
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
DUMP A;
Output:
(a,10,chennai)
(b,20,banglore)
Reference:
http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
Another simple option for Pig 0.9, without using SKIP_INPUT_HEADER option can be done as follows:
Input file(input.txt)
input.txt
Name,Age,Location
a,10,chennai
b,20,banglore
PigScript:(Without using SKIP_INPUT_HEADER option, as this option is not available in Pig 0.9)
register '<Your location>/piggybank.jar';
d_with_headers = LOAD 'input.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage() AS (name:chararray, age:long, location:chararray);
d = FILTER places_with_headers BY name!='Name';
dump d;
Output:
(a,10,chennai)
(b,20,banglore)