Hive user can stream table through script to transform that data:
ADD FILE replace-nan-with-zeros.py;
SELECT
TRANSFORM (...)
USING 'python replace-nan-with-zeros.py'
AS (...)
FROM some_table;
I have a simple Python script:
#!/usr/bin/env python
import sys
kFirstColumns= 7
def main(argv):
for line in sys.stdin:
line = line.strip();
inputs = line.split('\t')
# replace NaNs with zeros
outputs = [ ]
columnIndex = 1;
for value in inputs:
newValue = value
if columnIndex > kFirstColumns:
newValue = value.replace('NaN','0.0')
outputs.append(newValue)
columnIndex = columnIndex + 1
print '\t'.join(outputs)
if __name__ == "__main__":
main(sys.argv[1:])
How to make kFirstColumns to be a command-line or some other kind of parameter to this Python script?
Thank you!
Solution is really trivial. Use
ADD FILE replace-nan-with-zeros.py;
SELECT
TRANSFORM (...)
USING 'python replace-nan-with-zeros.py 7'
AS (...)
FROM some_table;
instead of just
...
USING 'python replace-nan-with-zeros.py'
...
It works fine for me.
Python script should be changed to:
kFirstColumns= int(sys.argv[1])
Well, you are already sort of doing it.
You are grabbing sys.argv[1:]
and passing it to main, but not using the arguments. What I would suggest (easiest route wise) would be to change your script as follows:
def main(kFirstColumns):
...
if __name__ == "__main__":
main(int(sys.argv[1]))
Then run your script like
$ python myScript.py 7
Then, you can look at argparse when you want to do more complicated command line options.
A bit of a hack, but you could pass the parameter by including it as an additional column in your query.
SELECT
TRANSFORM (...)
USING 'python replace-nan-with-zeros.py'
AS (...)
FROM (SELECT 7 AS kFirstColumns, * FROM some_table);
Then, when you parse the row in your script, the first column value will be the parameter you are looking for. Simply pop it into your local variable to remove it from the list of column values.
line = line.strip();
inputs = line.split('\t')
kFirstColumns = inputs.pop(0)
Hope that helps.