Why is parquet slower for me against text file for

2019-04-30 04:38发布

问题:

OK! So I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster then plain text files.

Please be noted that I am using Hive-0.13 on MapR

Follows the flow of my operations

Table A

Format - Text Format

Table size - 2.5 Gb

Table B

Format - Parquet

Table size - 1.9 Gb

[Create table B stored as parquet as select * from A]

Table C

Format - Parquet with snappy compression

Table size - 1.9 Gb

[Create table C stored as parquet tblproperties ("parquet.compression"="SNAPPY") as select * from A]

Now I ran some tests on above mentioned tables and follows the details.

  • Row count operation

Table A

Map - 15

Reduce - 1

Cumulative CPU - 123.33 sec

Time taken - 59.057 seconds

Table B

Map - 8

Reduce - 1

Cumulative CPU - 204.92 sec

Time taken - 50.33 seconds

  • Single Row Selection

Table A

Map - 15

Reduce - 0

Cumulative CPU - 51.18 sec

Time taken - 25.296 seconds

Table B

Map - 8

Reduce - 0

Cumulative CPU - 117.08 sec

Time taken - 27.448 seconds

  • Multi Row Selection Using Where clause [1000 rows fetched]

Table A

Map - 15

Reduce - 0

Cumulative CPU - 57.55 sec

Time taken - 20.254 seconds

Table B

Map - 8

Reduce - 0

Cumulative CPU - 113.97 sec

Time taken - 27.678 seconds

  • Multi Row Selection [with only 4 columns] Using Where clause [1000 rows fetched]

Table A

Map - 15

Reduce - 0

Cumulative CPU - 57.55 sec

Time taken - 20.254 seconds

Table B

Map - 8

Reduce - 0

Cumulative CPU - 113.97 sec

Time taken - 27.678 seconds

  • Aggregation operation [Using sum function on a given column]

Table A

Map - 15

Reduce - 1

Cumulative CPU - 127.85 sec

Time taken - 29.68 seconds

Table B

Map - 8

Reduce - 1

Cumulative CPU - 255.2 sec

Time taken - 41.025 seconds

You can see that in almost all the operations that I have applied on both the tables, Parquet is lagging behind in terms of time taken to execute the query with an exception of row count operation.

I also used table C to perform the aforementioned operations but the results were almost on similar lines with TextFile format again was snappier of the two.

Can some one please let me know what I am doing wrong?

Thanks!

EDIT

I added ORC to the list of storage formats and ran the tests again. Follows the details.

Row count operation

Text Format Cumulative CPU - 123.33 sec

Parquet Format Cumulative CPU - 204.92 sec

ORC Format Cumulative CPU - 119.99 sec

ORC with SNAPPY Cumulative CPU - 107.05 sec

Sum of a column operation

Text Format Cumulative CPU - 127.85 sec

Parquet Format Cumulative CPU - 255.2 sec

ORC Format Cumulative CPU - 120.48 sec

ORC with SNAPPY Cumulative CPU - 98.27 sec

Average of a column operation

Text Format Cumulative CPU - 128.79 sec

Parquet Format Cumulative CPU - 211.73 sec

ORC Format Cumulative CPU - 165.5 sec

ORC with SNAPPY Cumulative CPU - 135.45 sec

Selecting 4 columns from a given range using where clause

Text Format Cumulative CPU - 72.48 sec

Parquet Format Cumulative CPU - 136.4 sec

ORC Format Cumulative CPU - 96.63 sec

ORC with SNAPPY Cumulative CPU - 82.05 sec

Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

Thanks!