How to load xls data from multiple xls file into h

2019-04-15 00:14发布

问题:

I am learning to use Hadoop for performing Big Data related operations.

I need to perform some queries on a collection of data sets split across 8 xls files. Each xls file has multiple sheets and the query concerns only one of the sheets.

The dataset can be downloaded here : http://www.census.gov/hhes/www/hlthins/data/utilization/tables.html

I am not using any commerical distro of hadoop for my tasks, just have one master and a slave VM set up in VmWare with Hadoop, Hive, Pig in them.

I am a novice with Hadoop and Big Data, so if anyone could guide me with how to proceed further I'd be very grateful.

If you need information on the queries or anything else let me know.

Thanks.

回答1:

In hive you cannot Load data into the tables from xls directly, as you do for a txt or csv files.

You have two options:

  1. Write an application (eg, Java) to read the xls files and convert them into text or csv files that can be loaded directly into a hive.

OR

  1. You can create your own serde (Serializer or Deserializer) that you provide to parse your xls data to be loaded into a table.

Both have their pros and cons, but If you intend to use an application interacting with HIVE for loading, querying, transforming etc. You can go with option 1. But, if you intend to do via scripts/batch etc you can go with option 2.