I've started using pyspark in one of my projects. I was testing different commands to explore functionalities of the library and I found something that I don't understand.
Take this code:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.dataframe import Dataframe
sc = SparkContext(sc)
hc = HiveContext(sc)
hc.sql("use test_schema")
hc.table("diamonds").count()
the last count() operation returns 53941 records. If I run instead a select count(*) from diamonds in Hive I got 53940.
Is that pyspark count including the header?
I've tried to look into:
df = hc.sql("select * from diamonds").collect()
df[0]
df[1]
to see if header was included:
df[0] --> Row(carat=None, cut='cut', color='color', clarity='clarity', depth=None, table=None, price=None, x=None, y=None, z=None)
df[1] -- > Row(carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55, price=326, x=3.95, y=3.98, z=2.43)
The 0th element doesn't look like the header.
Anyone has an explanation for this?
Thanks! Ale
Hive can give incorrect counts when stale statistics are used to speed up calculations. To see if this is the problem, in Hive try:
Alternatively, refresh the statistics. If your table is not partitioned:
If it is partitioned:
Also take another look at your first row (
df[0]
in your question). It does look like an improperly formatted header row.