How would you go about employing and/or implementing a case class equivalent in PySpark?
相关问题
- how to define constructor for Python's new Nam
- streaming md5sum of contents of a large remote tar
- How to maintain order of key-value in DataFrame sa
- How to get the background from multiple images by
- Evil ctypes hack in python
If you go to sql-programming-guide in Inferring the Schema Using Reflection section, you will see
case class
being defined aswith example as
In the same section, if you switch to python i.e. pyspark, you will see
Row
being used and defined aswith example as
So the conclusion of the explanation is that
Row
can be used ascase class
in pysparkAs mentioned by Alex Hall a real equivalent of named product type, is a
namedtuple
.Unlike
Row
, suggested in the other answer, it has a number of useful properties:Has well defined shape and can be reliably used for structural pattern matching:
In contrast
Rows
are not reliable when used with keyword arguments:although if defined with positional arguments:
the order is preserved.
Define proper types
and can be used whenever type handling is required, especially with single:
and multiple dispatch:
and combined with the first property, there can be used in wide ranges of pattern matching scenarios.
namedtuples
also support standard inheritance and type hints.Rows
don't:Provide highly optimized representation. Unlike
Row
objects, tuple don't use__dict__
and carry field names with each instance. As a result there are can be order of magnitude faster to initialize:compared to different
Row
constructors:and are significantly more memory efficient (very important property when working with large scale data):
compared to equivalent
Row
Finally attribute access is order of magnitude faster with
namedtuple
:compared to equivalent operation on
Row
object:Last but not least
namedtuples
are properly supported in Spark SQLSummary:
It should be clear that
Row
is a very poor substitute for an actual product type, and should be avoided unless enforced by Spark API.It should be also clear that
pyspark.sql.Row
is not intended to be a replacement of a case class when you consider that, it is direct equivalent oforg.apache.spark.sql.Row
- type which is pretty far from an actual product, and behaves likeSeq[Any]
(depending on a subclass, with names added). Both Python and Scala implementations were introduced as an useful, albeit awkward interface between external code and internal Spark SQL representation.See also:
It would be a shame not to mention awesome MacroPy developed by Li Haoyi and its port (MacroPy3) by Alberto Berti:
which comes with a rich set of other features including, but not limited to, advanced pattern matching and neat lambda expression syntax.
Python
dataclasses
(Python 3.7+).