I am loading csv to dataframe using -
sqlContext.read.format("com.databricks.spark.csv").option("header", "true").
option("delimiter", ",").load("file.csv")
but my input file contains date in the first row and header from second row.
example
20160612
id,name,age
1,abc,12
2,bcd,33
How can i skip this first row while converting csv to dataframe?
Here are several options that I can think of since the data bricks module doesn't seem to provide a skip line option:
Option one: Add a "#" character in front of the first line, and the line will be automatically considered as comment and ignored by the data.bricks csv module;
Option two: Create your customized schema and specify the mode
option as DROPMALFORMED
which will drop the first line since it contains less token than expected in the customSchema:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val customSchema = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("mode", "DROPMALFORMED").
schema(customSchema).load("test.txt")
df.show
16/06/12 21:24:05 WARN CsvRelation$: Number format exception. Dropping
malformed line: id,name,age
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+
Note the warning message here which says dropped malformed line:
Option three: Write your own parser to drop the line that doesn't have length of three:
val file = sc.textFile("pathToYourCsvFile")
val df = file.map(line => line.split(",")).
filter(lines => lines.length == 3 && lines(0)!= "id").
map(row => (row(0), row(1), row(2))).
toDF("id", "name", "age")
df.show
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+