I am learning to use Hadoop for performing Big Data related operations.
I need to perform some queries on a collection of data sets split across 8 xls files. Each xls file has multiple sheets and the query concerns only one of the sheets.
The dataset can be downloaded here : http://www.census.gov/hhes/www/hlthins/data/utilization/tables.html
I am not using any commerical distro of hadoop for my tasks, just have one master and a slave VM set up in VmWare with Hadoop, Hive, Pig in them.
I am a novice with Hadoop and Big Data, so if anyone could guide me with how to proceed further I'd be very grateful.
If you need information on the queries or anything else let me know.
Thanks.
In hive you cannot Load data into the tables from xls directly, as you do for a txt or csv files.
You have two options:
OR
Both have their pros and cons, but If you intend to use an application interacting with HIVE for loading, querying, transforming etc. You can go with option 1. But, if you intend to do via scripts/batch etc you can go with option 2.