I have the following dataframe
val transactions_with_counts = sqlContext.sql(
"""SELECT user_id AS user_id, category_id AS category_id,
COUNT(category_id) FROM transactions GROUP BY user_id, category_id""")
I'm trying to convert the rows to Rating objects but since x(0) returns an array this fails
val ratings = transactions_with_counts
.map(x => Rating(x(0).toInt, x(1).toInt, x(2).toInt))
error: value toInt is not a member of Any
Using Datasets you can define Ratings as follows:
The Rating class here has a column name 'count' instead of 'rating' as zero323 suggested. Thus the rating variable is assigned as follows:
This way you will not run into run-time errors in Spark because your Rating class column name is identical to the 'count' column name generated by Spark on run-time.
To access a value of a row of Dataframe, you need to use
rdd.collect
of Dataframe with for loop.Consider your Dataframe looks like below.
Use
rdd.collect
on top of your Dataframe. Therow
variable will contain each row of Dataframe ofrdd
row type. To get each element from a row, userow.mkString(",")
which will contain value of each row in comma separated values. Usingsplit
function (inbuilt function) you can access each column value ofrdd
row with index.The above code looks little more bigger when compared to
dataframe.foreach
loops, but you will get more control over your logic by using the above code.Lets start with some dummy data:
There are a few ways to access
Row
values and keep expected types:Pattern matching
Typed
get*
methods likegetInt
,getLong
:getAs
method which can use both names and indices:It can be used to properly extract user defined types, including
mllib.linalg.Vector
. Obviously accessing by name requires a schema.Converting to statically typed
Dataset
(Spark 1.6+ / 2.0+):