I am using Zeppelin 0.5.5. I found this code/sample here for python as I couldn't get my own to work with %pyspark http://www.makedatauseful.com/python-spark-sql-zeppelin-tutorial/. I have a feeling his %pyspark example worked because if you using the original %spark zeppelin tutorial the "bank" table is already created.
This code is in a notebook.
%pyspark
from os import getcwd
# sqlContext = SQLContext(sc) # Removed with latest version I tested
zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")
bankSchema = StructType([StructField("age", IntegerType(), False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])
bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))
bankdf = sqlContext.createDataFrame(bank,bankSchema)
bankdf.registerAsTable("bank")
This code is in the same notebook but different work pad.
%sql
SELECT count(1) FROM bank
org.apache.spark.sql.AnalysisException: no such table bank; line 1 pos 21
...
I found the problem to this issue. Prior to 0.6.0 the sqlContext variable is sqlc in %pyspark.
Defect can be found here: https://issues.apache.org/jira/browse/ZEPPELIN-134
In Pyspark, the SQLContext is currently available in the variable name sqlc. This is incosistent with the documentation and with the variable name in scala which is sqlContext.
sqlContext can be used as a variable for the SQLContext, in addition to sqlc (for backward compatibility)
Related code:
https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/resources/python/zeppelin_pyspark.py#L66
The suggested workaround is simply to do the following in your %pyspark script
sqlContext = sqlc
Found here:
https://mail-archives.apache.org/mod_mbox/incubator-zeppelin-users/201506.mbox/%3CCALf24sazkTxVd3EpLKTWo7yfE4NvW032j346N+6AuB7KKZS_AQ@mail.gmail.com%3E
Instead of sqlContext, use sqlc and replace registerAsTable by sqlc.registerDataFrameAsTable
%pyspark
from os import getcwd
zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")
bankSchema = StructType([StructField("age", IntegerType(), False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])
bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s[0] != "\"age\"").map(lambda s:(int(s[0]), str(s[1]).replace("\"", ""), str(s[2]).replace("\"", ""), str(s[3]).replace("\"", ""), int(s[5]) ))
bankdf = sqlc.createDataFrame(bank,bankSchema)
sqlc.registerDataFrameAsTable(bankdf, "bank")
%sql
SELECT count(1) FROM bank