I have such DataFrame in PySpark (this is the result of a take(3), the dataframe is very big):
sc = SparkContext()
df = [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)]
the same owner will have more rows. What I need to do is summing the values of the field a_d per owner, after grouping, as
b = df.groupBy('owner').agg(sum('a_d').alias('a_d_sum'))
but this throws error
TypeError: unsupported operand type(s) for +: 'int' and 'str'
However, the schema contains double values, not strings (this comes from a printSchema()):
root
|-- owner: string (nullable = true)
|-- a_d: double (nullable = true)
So what is happening here?
You are not using the correct sum function but the built-in
function sum
(by default).
So the reason why the build-in
function won't work is
that's it takes an iterable as an argument where as here the name of the column passed is a string and the built-in
function can't be applied on a string. Ref. Python Official Documentation.
You'll need to import the proper function from pyspark.sql.functions
:
from pyspark.sql import Row
from pyspark.sql.functions import sum as _sum
df = sqlContext.createDataFrame(
[Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)]
)
df2 = df.groupBy('owner').agg(_sum('a_d').alias('a_d_sum'))
df2.show()
# +-----+-------+
# |owner|a_d_sum|
# +-----+-------+
# | u1| 0.4|
# | u2| 0.0|
# +-----+-------+