I have been working on a project that include a hive query.
INSERT INTO OVERWRITE .... TRANSFORM (....) USING 'python script.py' FROM .... LEFT OUTER JOIN . . . LEFT OUTER JOIN . . . LEFT OUTER JOIN
At the begining everything work fine until we loaded a big amount of dummy data. We just write the same records with small variations on some fields. After that we run this again and we are getting a Broken pipe error without much information. There is no log about the error, just the IOException: Broken pipe error. . . .
To simplify the script and isolate errors we modify the script to
for line in sys.stdin.readlines():
print line
to avoid any error at that level. We still have the same error.
Another work around on this is to remove the transform and generate a new query inserting the data in another table just running the transformation. I'm not 100% sure why, the scrtip is correct. I think the issue may be a really big amount of data streamed because of the so many joins.
The problem seems to be solved by spliting so many joins in different queries and using intermediate tables. Then you just add a final query with a last join summarizing all the previous results. As I understand this mean no error at the script level but too many data to handle by hive