I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why?
Mike
tl;dr : I would read it straight from the parquet files
I am using Spark 1.5.2 and Hive 1.2.1 For a 5Million row X 100 column table some timings I've recorded are
Note that these were done with an older version of Hive and an older version of Spark so I can't comment on how speed improvements could have occurred between the two reading mechanisms
From what I understand, even though in general
.ORC
is better suited for flat structures andparquet
for nested ones,spark
is optimised towardsparquet
. Therefore, it is advised to use that format withspark
.Furthermore,
Metadata
for all your read tables fromparquet
will be stored inhive
anyway. This is spark doc:Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.
I tend to transform data asap into
parquet
format and store italluxio
backed byhdfs
. This allows me to achieve better performance forread/write
operations, and limit usingcache
.I hope it helps.