I tried a simple example like:
data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/databricks-datasets/samples/population-vs-price/data_geo.csv")
data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("2014 Population estimate", "2015 median sales price").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
It works well, But when i try something very similar like:
data = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load('/mnt/%s/OnlineNewsTrainingAndValidation.csv' % MOUNT_NAME)
data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values
data = data.select("timedelta", "shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
display(data)
It raise error:
AnalysisException: u"cannot resolve 'timedelta' given input columns: [ data_channel_is_tech,...
off-course I imported LabeledPoint and LinearRegression
What could be wrong?
Even the simpler case
df_cleaned = df_cleaned.select("shares")
raises same AnalysisException (error).
*please note: df_cleaned.printSchema() works well.
I found the issue: some of the column names contain white spaces before the name itself.
So
data = data.select(" timedelta", " shares").map(lambda r: LabeledPoint(r[1], [r[0]])).toDF()
worked.
I could catch the white spaces using
assert " " not in ''.join(df.columns)
Now I am thinking of a way to remove the white spaces. Any idea is much appreciated!
Because header contains spaces or tabs,remove spaces or tabs and try
1) My example script
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df=spark.read.csv(r'test.csv',header=True,sep='^')
print("#################################################################")
print df.printSchema()
df.createOrReplaceTempView("test")
re=spark.sql("select max_seq from test")
print(re.show())
print("################################################################")
2) Input file,here 'max_seq ' contains space so we are getting bellow exception
Trx_ID^max_seq ^Trx_Type^Trx_Record_Type^Trx_Date
Traceback (most recent call last):
File "D:/spark-2.1.0-bin-hadoop2.7/bin/test.py", line 14, in <module>
re=spark.sql("select max_seq from test")
File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\session.py", line 541, in sql
File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "D:\spark-2.1.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u"cannot resolve '`max_seq`' given input columns: [Venue_City_Name, Trx_Type, Trx_Booking_Status_Committed, Payment_Reference1, Trx_Date, max_seq , Event_ItemVariable_Name, Amount_CurrentPrice, cinema_screen_count, Payment_IsMyPayment, r
2) Remove space after 'max_seq' column then it will work fine
Trx_ID^max_seq^Trx_Type^Trx_Record_Type^Trx_Date
17/03/20 12:16:25 INFO DAGScheduler: Job 3 finished: showString at <unknown>:0, took 0.047602 s
17/03/20 12:16:25 INFO CodeGenerator: Code generated in 8.494073 ms
max_seq
10
23
22
22
only showing top 20 rows
None
##############################################################
As there were tabs in my input file, removing the tabs or spaces in the header helped display the answer.
My example:
saledf = spark.read.csv("SalesLTProduct.txt", header=True, inferSchema= True, sep='\t')
saledf.printSchema()
root
|-- ProductID: string (nullable = true)
|-- Name: string (nullable = true)
|-- ProductNumber: string (nullable = true)
saledf.describe('ProductNumber').show()
+-------+-------------+
|summary|ProductNumber|
+-------+-------------+
| count| 295|
| mean| null|
| stddev| null|
| min| BB-7421|
| max| WB-H098|
+-------+-------------+