OK! So I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster then plain text files.
Please be noted that I am using Hive-0.13 on MapR
Follows the flow of my operations
Table A
Format - Text Format
Table size - 2.5 Gb
Table B
Format - Parquet
Table size - 1.9 Gb
[Create table B stored as parquet as select * from A]
Table C
Format - Parquet with snappy compression
Table size - 1.9 Gb
[Create table C stored as parquet tblproperties ("parquet.compression"="SNAPPY") as select * from A]
Now I ran some tests on above mentioned tables and follows the details.
- Row count operation
Table A
Map - 15
Reduce - 1
Cumulative CPU - 123.33 sec
Time taken - 59.057 seconds
Table B
Map - 8
Reduce - 1
Cumulative CPU - 204.92 sec
Time taken - 50.33 seconds
- Single Row Selection
Table A
Map - 15
Reduce - 0
Cumulative CPU - 51.18 sec
Time taken - 25.296 seconds
Table B
Map - 8
Reduce - 0
Cumulative CPU - 117.08 sec
Time taken - 27.448 seconds
- Multi Row Selection Using Where clause [1000 rows fetched]
Table A
Map - 15
Reduce - 0
Cumulative CPU - 57.55 sec
Time taken - 20.254 seconds
Table B
Map - 8
Reduce - 0
Cumulative CPU - 113.97 sec
Time taken - 27.678 seconds
- Multi Row Selection [with only 4 columns] Using Where clause [1000 rows fetched]
Table A
Map - 15
Reduce - 0
Cumulative CPU - 57.55 sec
Time taken - 20.254 seconds
Table B
Map - 8
Reduce - 0
Cumulative CPU - 113.97 sec
Time taken - 27.678 seconds
- Aggregation operation [Using sum function on a given column]
Table A
Map - 15
Reduce - 1
Cumulative CPU - 127.85 sec
Time taken - 29.68 seconds
Table B
Map - 8
Reduce - 1
Cumulative CPU - 255.2 sec
Time taken - 41.025 seconds
You can see that in almost all the operations that I have applied on both the tables, Parquet is lagging behind in terms of time taken to execute the query with an exception of row count operation.
I also used table C to perform the aforementioned operations but the results were almost on similar lines with TextFile format again was snappier of the two.
Can some one please let me know what I am doing wrong?
Thanks!
EDIT
I added ORC to the list of storage formats and ran the tests again. Follows the details.
Row count operation
Text Format Cumulative CPU - 123.33 sec
Parquet Format Cumulative CPU - 204.92 sec
ORC Format Cumulative CPU - 119.99 sec
ORC with SNAPPY Cumulative CPU - 107.05 sec
Sum of a column operation
Text Format Cumulative CPU - 127.85 sec
Parquet Format Cumulative CPU - 255.2 sec
ORC Format Cumulative CPU - 120.48 sec
ORC with SNAPPY Cumulative CPU - 98.27 sec
Average of a column operation
Text Format Cumulative CPU - 128.79 sec
Parquet Format Cumulative CPU - 211.73 sec
ORC Format Cumulative CPU - 165.5 sec
ORC with SNAPPY Cumulative CPU - 135.45 sec
Selecting 4 columns from a given range using where clause
Text Format Cumulative CPU - 72.48 sec
Parquet Format Cumulative CPU - 136.4 sec
ORC Format Cumulative CPU - 96.63 sec
ORC with SNAPPY Cumulative CPU - 82.05 sec
Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?
Thanks!