How to load multi-line column data in hive table?

2019-07-26 04:45发布

问题:

I have a column (not the last column) in Excel file that contains data which is spanning over few lines.

Some cells of column is blank and some have single lines entries.

When saving as .CSV file or a tab separated .txt from excel, all the multi-line data and few single line entries are getting generated in double quotes, None of the blank fields are in quotes. Some of the single line entries are not within quotes.

Is it possible to store the data with this same structure in a hive table? If Yes, how can this be done? I understand I need to escape all the LF within double-quotes and take care of the last LF only as the actual EOL. But the moment a '\n' is encountered, Hive takes data to a new row.

The format of data in excel is like as below:

|------+------+--------+------------------+-------+------|
|row1: | col1 | col2   | col3(multi-line) | col4  | col5 |
|------+------+--------+------------------+-------+------|
|      |      |        | line 1 of 3      |       |      |
|row2: | abc  | defsa  | line 2 of 3      | bcde  | hft  |
|      |      |        | line 3 of 3      |       |      |
|------+------+--------+------------------+-------+------|
|row3: | abc2 | defsa2 | (blank)          | bcde2 | hft2 |
|------+------+--------+------------------+-------+------|
|row4: | abc3 | defsa3 | single-line1     | bcde3 | hft3 |
|------+------+--------+------------------+-------+------|
|row5: | abc4 | defsa4 | single-line2     | bcde4 | hft4 |
|------+------+--------+------------------+-------+------|

When saved as CVS it outputs to the following:

row1--col1,col2,col3(multi-line),col4,col5
row2--abc,defsa,line 1 of 3",,,,,,
row3--line 2 of 3,,,,,,
row4--line 3 of 3,,,,,,
row5--",bcde,hft
row6--abc2,defsa2,,bcde2,hft2
row7--abc3,defsa3,single-line1,bcde3,hft3
row8--abc4,defsa4,single-line2",,,,,,
row9--",bcde4,hft4

5 rows of excel to 9 rows of csv.

Appreciate inputs to store from this .csv file into a hive table, if possible without changing the structure and maintaining the multi-line column.

回答1:

From this link, the provided SerDe cannot handle embedded new lines. My guess is that if you want embedded new lines, you will have to create a custom SerDe. Without looking too deeply into it, this is a good resource that might help in creating a custom SerDe.

Have you tried using Pig to process the data before loading it to Hive, e.g. you could substitute the \n char with something else before moving it to Hive. But you might run into the same problem of not being able to load it into Pig accurately since it's probably using the same SerDe.

Ultimately, a custom SerDe WILL solve your problem, but there might be another easier way I'm not seeing.