I have a SparkSQL DataFrame like this:
name gender age isActive points
-------------------------------
Bob M 12 true 100
Hal M 16 false 80
Pat F 21 true 70
Lin F 17 false 40
Zac M 18 true 20
Mei F 19 true 10
Sal M 13 false 10
I have a few functions like so:
def isEligible(prog: String) (name: String, gender: String, age: Int, isActive: Boolean, points: Int): Boolean
that determines whether someone is eligible for a particular program. For Instance, the following function call would tell me whether Bob is eligible for Program1:
isEligible("Program1", "Bob", "M", 12, true, 100)
A person may be eligible for more than one program. I want to write a function that takes this DataFrame, and outputs a summary DataFrame like so:
prog1 prog2 prog3 prog4
-----------------------
7 3 2 5
which shows the number of people who are eligible for each program. What is the best way to do this in Spark? I know I can use struct
and agg
functions, but I don't know how to incorporate my isEligible
function into the SparkSQL query.