Hadoop ORC file - How it works - How to fetch meta

2020-06-04 13:27发布

问题:

I am new to ORC file. I went through many blogs, but didn't get clear understanding. Please help and clarify below questions.

  1. Can I fetch schema from ORC file? I know in Avro, schema can fetched.

  2. How it actually provides schema evolution? I know that few columns can be added. But how to do it. The only I know, creating orc file is by loading data into hive table which store data in orc format.

  3. How ORC files index works? What I know is for every stripe index will be maintained. But as file is not sorted how it helps looking up data in list of stripes. How it helps in skipping stripes while looking up for the data?

  4. Is index maintained for every column. If yes, then is it not going to consume more memory?

  5. How columnar format ORC file can fit into hive table, where values of each columns are stored together. whereas hive table is made to fetch record by record. How both will fit together?

回答1:

1. and 2. Use Hive and/or HCatalog to create, read, update ORC table structure in the Hive metastore (HCatalog is just a side door than enables Pig/Sqoop/Spark/whatever to access the metastore directly)

2. ALTER TABLE command allows to add/drop columns whatever the storage type, ORC included. But beware of a nasty bug that may crash vectorized reads after that (at least in V0.13 and V0.14)

3. and 4. The term "index" is rather inappropriate. Basically it's just min/max information persisted in the stripe footer at write time, then used at read time for skipping all stripes that are clearly not meeting the WHERE requirements, drastically reducing I/O in some cases (a trick that has become popular in columns stores e.g. InfoBright on MySQL, but also in Oracle Exadata appliances [dubbed "smart scan" by Oracle marketing])

5. Hive works with "row store" formats (Text, SequenceFile, AVRO) and "column store" formats (ORC, Parquet) alike. The optimizer just uses specific strategies and shortcuts on the initial Map phase -- e.g. stripe elimination, vectorized operators -- and of course the serialization/deserialization phases are a bit more elaborate with column stores.



回答2:

Hey i can not help you with all of your questions but i'll give it a try

  1. you can use the filedump utility to read out the metadata of an ORC-file see here

  2. I am very unsure about the schema evolution but as far as i know ORC does not support evolution.

  3. ORC index stores sum min and max so if your data is totally unstructured you probably would still have to read a lot of data. But since the latest release of ORC you can anable an additional Bloom-Filter which is more accurate in row group elimination. Maybe this could be helpful too orc-user mailing list

  4. ORC provides an index for every column but it's just a light weight index. You store information about min/max and sum on numeric columns in the filefooter, stripefooter and by default every 10000 rows. so it does not take that much space

  5. If you store your table in Orc Fileformat Hive will use an specific ORC Recordreader to extract the rows from the columns. The advantage of columnar storage is that you do not have to read the whole row