I am trying to figure out how to test Spark SQL queries against a Cassandra database -- kind of like you would in SQL Server Management Studio. Currently I have to open the Spark Console and type Scala commands which is really tedious and error prone.
Something like:
scala > var query = csc.sql("select * from users");
scala > query.collect().foreach(println)
Especially with longer queries this can be a real pain.
This seems like a terribly inefficient way to test if your query is correct and what data you will get back. The other issue is when your query is wrong you get back a mile long error message and you have to scroll up the console to find it. How do I test my spark queries without using the console or writing my own application?
You could use bin/spark-sql
to avoid construct Scala program and just write SQL.
In order to use bin/spark-sql
you may need to rebuild your spark with -Phive
and -Phive-thriftserver
.
More informations on Building Spark. Note: do not build against Scala2.11
, thrift server dependencies seem not ready for the moment.
You can write SQL in a file, read it in a variable in your testing script and set ssc.sql(file.read()) [Python way]
But it seems you are looking for something else. A test approach may be?
Here is one example:
[donghua@vmxdb01 ~]$ $SPARK_HOME/bin/spark-sql --packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11 --conf spark.cassandra.connection.host=127.0.0.1
spark-sql> select * from kv where value > 2;
Error in query: Table or view not found: kv; line 1 pos 14
spark-sql> create TEMPORARY TABLE kv USING org.apache.spark.sql.cassandra OPTIONS (table "kv",keyspace "mykeyspace", cluster "Test Cluster",pushdown "true");
16/10/12 08:28:09 WARN SparkStrategies$DDLStrategy: CREATE TEMPORARY TABLE kv USING... is deprecated, please use CREATE TEMPORARY VIEW viewName USING... instead
Time taken: 4.008 seconds
spark-sql> select * from kv;
key1 1
key4 4
key3 3
key2 2
Time taken: 2.253 seconds, Fetched 4 row(s)
spark-sql> select substring(key,1,3) from kv;
key
key
key
key
Time taken: 1.328 seconds, Fetched 4 row(s)
spark-sql> select substring(key,1,3),count(*) from kv group by substring(key,1,3);
key 4
Time taken: 3.518 seconds, Fetched 1 row(s)
spark-sql>