Following this question, I now run this code:
List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));
fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));
StructType schema1 = DataTypes.createStructType(fields);
Dataset<Row> df1 = spark.sql("select 1 as A, 2.2 as B");
Dataset<Row> finalDf1 = spark.createDataFrame(df1.javaRDD(), schema1);
fields = new ArrayList<>();
fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));
fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));
StructType schema2 = DataTypes.createStructType(fields);
Dataset<Row> df2 = spark.sql("select 2.2 as B, 1 as A");
Dataset<Row> finalDf2 = spark.createDataFrame(df2.javaRDD(), schema2);
finalDf1.printSchema();
finalDf2.printSchema();
System.out.println(finalDf1.schema());
System.out.println(finalDf2.schema());
System.out.println(finalDf1.schema().equals(finalDf2.schema()));
Here's the output:
root
|-- A: long (nullable = true)
|-- B: double (nullable = true)
root
|-- B: double (nullable = true)
|-- A: long (nullable = true)
StructType(StructField(A,LongType,true), StructField(B,DoubleType,true))
StructType(StructField(B,DoubleType,true), StructField(A,LongType,true))
false
While the columns are not arranges in the same order, both these datasets have exactly the same columns and columns types. What comparison in required here in order to get true
?
If they have different order then they are not the same. even that both of them have the same number of columns and the same names. if you want to see if the both schemas have the same column names then get the schema in a list from both Dataframes and you write the code to compare them. see java example below
public static void main(String[] args)
{
List<String> firstSchema =Arrays.asList(DataTypes.createStructType(ConfigConstants.firstSchemaFields).fieldNames());
List<String> secondSchema = Arrays.asList(DataTypes.createStructType(ConfigConstants.secondSchemaFields).fieldNames());
if(schemasHaveTheSameColumnNames(firstSchema,secondSchema))
{
System.out.println("Yes, schemas have the same column names");
}else
{
System.out.println("No, schemas do not have the same column names");
}
}
private static boolean schemasHaveTheSameColumnNames(List<String> firstSchema, List<String> secondSchema)
{
if(firstSchema.size() != secondSchema.size())
{
return false;
}else
{
for (String column : secondSchema)
{
if(!firstSchema.contains(column))
return false;
}
}
return true;
}
Assuming order cols does not match and same name is same semantics and same number of columns is required.
An example using SCALA, you should be able to tailor to JAVA:
import spark.implicits._
val df = sc.parallelize(Seq(
("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
)).toDF("c1", "c2", "Val1", "Val2")
val names = df.columns
val df2 = sc.parallelize(Seq(
("A", "X", 2, 1))).toDF("c1", "c2", "Val1", "Val2")
val names2 = df2.columns
names.sortWith(_ < _) sameElements names2.sortWith(_ < _)
returns true or false, experiment with the input.
Following the previous answers, seems like the fastest way to compare the StructFields
(columns and types) and not just the names, is the following:
Set<StructField> set1 = new HashSet<>(Arrays.asList(schema1.fields()));
Set<StructField> set2 = new HashSet<>(Arrays.asList(schema2.fields()));
boolean result = set1.equals(set2);