Why is parquet slower for me against text file for

OK! So I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster then plain text files.

Please be noted that I am using Hive-0.13 on MapR

Follows the flow of my operations

Table A

Format - Text Format

Table size - 2.5 Gb

Table B

Format - Parquet

Table size - 1.9 Gb

[Create table B stored as parquet as select * from A]

Table C

Format - Parquet with snappy compression

Table size - 1.9 Gb

[Create table C stored as parquet tblproperties ("parquet.compression"="SNAPPY") as select * from A]

Now I ran some tests on above mentioned tables and follows the details.

Row count operation

Table A

Map - 15

Reduce - 1

Cumulative CPU - 123.33 sec

Time taken - 59.057 seconds

Table B

Map - 8

Reduce - 1

Cumulative CPU - 204.92 sec

Time taken - 50.33 seconds

Single Row Selection

Table A

Map - 15

Reduce - 0

Cumulative CPU - 51.18 sec

Time taken - 25.296 seconds

Table B

Map - 8

Reduce - 0

Cumulative CPU - 117.08 sec

Time taken - 27.448 seconds

Multi Row Selection Using Where clause [1000 rows fetched]

Table A

Map - 15

Reduce - 0

Cumulative CPU - 57.55 sec

Time taken - 20.254 seconds

Table B

Map - 8

Reduce - 0

Cumulative CPU - 113.97 sec

Time taken - 27.678 seconds

Multi Row Selection [with only 4 columns] Using Where clause [1000 rows fetched]

Table A

Map - 15

Reduce - 0

Cumulative CPU - 57.55 sec

Time taken - 20.254 seconds

Table B

Map - 8

Reduce - 0

Cumulative CPU - 113.97 sec

Time taken - 27.678 seconds

Aggregation operation [Using sum function on a given column]

Table A

Map - 15

Reduce - 1

Cumulative CPU - 127.85 sec

Time taken - 29.68 seconds

Table B

Map - 8

Reduce - 1

Cumulative CPU - 255.2 sec

Time taken - 41.025 seconds

You can see that in almost all the operations that I have applied on both the tables, Parquet is lagging behind in terms of time taken to execute the query with an exception of row count operation.

I also used table C to perform the aforementioned operations but the results were almost on similar lines with TextFile format again was snappier of the two.

Can some one please let me know what I am doing wrong?

Thanks!

EDIT

I added ORC to the list of storage formats and ran the tests again. Follows the details.

Row count operation

Text Format Cumulative CPU - 123.33 sec

Parquet Format Cumulative CPU - 204.92 sec

ORC Format Cumulative CPU - 119.99 sec

ORC with SNAPPY Cumulative CPU - 107.05 sec

Sum of a column operation

Text Format Cumulative CPU - 127.85 sec

Parquet Format Cumulative CPU - 255.2 sec

ORC Format Cumulative CPU - 120.48 sec

ORC with SNAPPY Cumulative CPU - 98.27 sec

Average of a column operation

Text Format Cumulative CPU - 128.79 sec

Parquet Format Cumulative CPU - 211.73 sec

ORC Format Cumulative CPU - 165.5 sec

ORC with SNAPPY Cumulative CPU - 135.45 sec

Selecting 4 columns from a given range using where clause

Text Format Cumulative CPU - 72.48 sec

Parquet Format Cumulative CPU - 136.4 sec

ORC Format Cumulative CPU - 96.63 sec

ORC with SNAPPY Cumulative CPU - 82.05 sec

Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

Thanks!

标签： hadoop hive parquet mapr snappy

0条回答

Why is parquet slower for me against text file for

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间