Suppose I am creating a data frame from a list without a schema:
data = [Row(c=0, b=1, a=2), Row(c=10, b=11, a=12)]
df = spark.createDataFrame(data)
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 2| 1| 0|
| 12| 11| 10|
+---+---+---+
Why are the columns reordered in alphabet order ?
Can I preserve the original order of columns without adding a schema ?
Why are the columns reordered in alphabet order ?
Because Row
created with **kwargs
sorts the arguments by name.
This design choice is required to address the issues described in PEP 468. Please check SPARK-12467 for a discussion.
Can I preserve the original order of columns without adding a schema ?
Not with **kwargs
. You can use plain tuples
:
df = spark.createDataFrame([(0, 1, 2), (10, 11, 12)], ["c", "b", "a"])
or namedtuple
:
from collections import namedtuple
CBA = namedtuple("CBA", ["c", "b", "a"])
spark.createDataFrame([CBA(0, 1, 2), CBA(10, 11, 12)])