In SQL something like
SELECT count(id), sum(if(column1 = 1, 1, 0)) from groupedTable
could be formulated to perform a count of the total records as well as filtered records in a single pass.
How can I perform this in spark-data-frame API? i.e. without needing to join back one of the counts to the original data frame.
Just use count
for both cases:
df.select(count($"id"), count(when($"column1" === 1, true)))
If column is nullable
you should correct for that (for example with coalesce
or IS NULL
, depending on the desired output).
You can try using spark with hive as hive supports sum if() functionality of SQL
First you need to create hive table on top of your data using below code
val conf = new SparkConf().setAppName("Hive_Test")
val sc = new SparkContext(conf)
//Creation of hive context
val hsc = new HiveContext(sc)
import spark.implicits._
import spark.sql
hsc.sql("CREATE TABLE IF NOT EXISTS emp (id INT, name STRING)")
hsc.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/test.txt' INTO TABLE emp")
hsc.sql("""select count(id), SUM(v)
from (
select id, IF(name=1, count(*), 0) AS v
from emp
where id>0
group by id,name
) t2""")