How to compute summary statistic on Cassandra tabl

2020-05-10 07:38发布

问题:

I'm trying to get the min, max mean of some Cassandra/SPARK data but I need to do it with JAVA.

import org.apache.spark.sql.DataFrame;
import static org.apache.spark.sql.functions.*;

DataFrame df = sqlContext.read()
        .format("org.apache.spark.sql.cassandra")
        .option("table",  "someTable")
        .option("keyspace", "someKeyspace")
        .load();

df.groupBy(col("keyColumn"))
        .agg(min("valueColumn"), max("valueColumn"), avg("valueColumn"))
        .show();

EDITED to show working version: Make sure to put " around the someTable and someKeyspace

回答1:

Just import your data as a DataFrame and apply required aggregations:

import org.apache.spark.sql.DataFrame;
import static org.apache.spark.sql.functions.*;

DataFrame df = sqlContext.read()
        .format("org.apache.spark.sql.cassandra")
        .option("table", someTable)
        .option("keyspace", someKeyspace)
        .load();

df.groupBy(col("keyColumn"))
        .agg(min("valueColumn"), max("valueColumn"), avg("valueColumn"))
        .show();

where someTable and someKeyspace store table name and keyspace respectively.



回答2:

I suggest checking out https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector-demos

Which contains demos in both Scala and the equivalent Java.

You can also check out: http://spark.apache.org/documentation.html

Which has tons of examples that you can flip between Scala, Java, and Python versions.

I'm almost 100% certain that between those to links, you'll find exactly what you're looking for.

If there's anything you're having trouble with after that, feel free to update your question with a more specific error/problem.



回答3:

In general,

compile scala file: $ scalac Main.scala

create your java source file from Main.class file: $ javap Main

More info is available at following url: http://alvinalexander.com/scala/scala-class-to-decompiled-java-source-code-classes