Spark Java: how to compare schemas when columns ar

2019-09-21 21:55发布

问题:

Following this question, I now run this code:

List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));
fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));
StructType schema1 = DataTypes.createStructType(fields);
Dataset<Row> df1 = spark.sql("select 1 as A, 2.2 as B");
Dataset<Row> finalDf1 = spark.createDataFrame(df1.javaRDD(), schema1);

fields = new ArrayList<>();
fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));
fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));
StructType schema2 = DataTypes.createStructType(fields);
Dataset<Row> df2 = spark.sql("select 2.2 as B, 1 as A");
Dataset<Row> finalDf2 = spark.createDataFrame(df2.javaRDD(), schema2);

finalDf1.printSchema();
finalDf2.printSchema();
System.out.println(finalDf1.schema());
System.out.println(finalDf2.schema());
System.out.println(finalDf1.schema().equals(finalDf2.schema()));

Here's the output:

root
 |-- A: long (nullable = true)
 |-- B: double (nullable = true)

root
 |-- B: double (nullable = true)
 |-- A: long (nullable = true)

StructType(StructField(A,LongType,true), StructField(B,DoubleType,true))
StructType(StructField(B,DoubleType,true), StructField(A,LongType,true))
false

While the columns are not arranges in the same order, both these datasets have exactly the same columns and columns types. What comparison in required here in order to get true?

回答1:

If they have different order then they are not the same. even that both of them have the same number of columns and the same names. if you want to see if the both schemas have the same column names then get the schema in a list from both Dataframes and you write the code to compare them. see java example below

public static void main(String[] args)
{

    List<String> firstSchema =Arrays.asList(DataTypes.createStructType(ConfigConstants.firstSchemaFields).fieldNames());
    List<String> secondSchema = Arrays.asList(DataTypes.createStructType(ConfigConstants.secondSchemaFields).fieldNames());


    if(schemasHaveTheSameColumnNames(firstSchema,secondSchema))
    {
        System.out.println("Yes, schemas have the same column names");
    }else
    {
        System.out.println("No, schemas do not have the same column names");
    }
}

private static boolean schemasHaveTheSameColumnNames(List<String> firstSchema, List<String> secondSchema)
{
    if(firstSchema.size() != secondSchema.size())
    {
        return false;
    }else 
    {
        for (String column : secondSchema)
        {
            if(!firstSchema.contains(column))
                return false;
        }
    }
    return true;
}


回答2:

Assuming order cols does not match and same name is same semantics and same number of columns is required.

An example using SCALA, you should be able to tailor to JAVA:

import spark.implicits._
val df = sc.parallelize(Seq(
        ("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
        ("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
        )).toDF("c1", "c2", "Val1", "Val2")
val names = df.columns

val df2 = sc.parallelize(Seq(
       ("A", "X", 2, 1))).toDF("c1", "c2", "Val1", "Val2")
val names2 = df2.columns

names.sortWith(_ < _) sameElements names2.sortWith(_ < _)

returns true or false, experiment with the input.



回答3:

Following the previous answers, seems like the fastest way to compare the StructFields (columns and types) and not just the names, is the following:

Set<StructField> set1 = new HashSet<>(Arrays.asList(schema1.fields()));
Set<StructField> set2 = new HashSet<>(Arrays.asList(schema2.fields()));
boolean result = set1.equals(set2);