I have a CSV file with below data :
1,2,5
2,4
2,3
I want to load them into a Dataframe having schema of string of array
The output should be like below.
[1, 2, 5]
[2, 4]
[2, 3]
This has been answered using scala here:
Spark: Convert column of string to an array
I want to make it happen in Java.
Please help
Below is the sample code in Java. You need to read your file using spark.read().text(String path)
method and then call the split
function.
import static org.apache.spark.sql.functions.split;
public class SparkSample {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkSample")
.master("local[*]")
.getOrCreate();
//Read file
Dataset<Row> ds = spark.read().text("c://tmp//sample.csv").toDF("value");
ds.show(false);
Dataset<Row> ds1 = ds.select(split(ds.col("value"), ",")).toDF("new_value");
ds1.show(false);
ds1.printSchema();
}
}
you can use VectorAssembler class to create as array of features, which is particulary useful with pipelines:
val assembler = new VectorAssembler()
.setInputCols(Array("city", "status", "vendor"))
.setOutputCol("features")
https://spark.apache.org/docs/2.2.0/ml-features.html#vectorassembler