问题:

In SQL something like

SELECT  count(id), sum(if(column1 = 1, 1, 0)) from groupedTable

could be formulated to perform a count of the total records as well as filtered records in a single pass.

How can I perform this in spark-data-frame API? i.e. without needing to join back one of the counts to the original data frame.

回答1:

Just use count for both cases:

df.select(count($"id"), count(when($"column1" === 1, true)))

If column is nullable you should correct for that (for example with coalesce or IS NULL, depending on the desired output).

回答2:

You can try using spark with hive as hive supports sum if() functionality of SQL

First you need to create hive table on top of your data using below code

val conf = new SparkConf().setAppName("Hive_Test")
val sc = new SparkContext(conf)
//Creation of hive context      
val hsc = new HiveContext(sc)

import spark.implicits._
import spark.sql

hsc.sql("CREATE TABLE IF NOT EXISTS emp (id INT, name STRING)")

hsc.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/test.txt' INTO TABLE emp")


hsc.sql("""select count(id), SUM(v)
from (
select id, IF(name=1, count(*), 0) AS v
from emp
where id>0
group by id,name
) t2""")