How to pass parameters to Python streaming script

2019-06-05 23:34发布

问题:

Hive user can stream table through script to transform that data:

ADD FILE replace-nan-with-zeros.py;

SELECT
  TRANSFORM (...)
  USING 'python replace-nan-with-zeros.py'
  AS (...)
FROM some_table;

I have a simple Python script:

#!/usr/bin/env python
import sys


kFirstColumns= 7

def main(argv):

    for line in sys.stdin:
        line = line.strip();
        inputs = line.split('\t')

        # replace NaNs with zeros
        outputs = [ ]
        columnIndex = 1;
        for value in inputs:
            newValue = value
            if columnIndex > kFirstColumns:
                newValue = value.replace('NaN','0.0')
            outputs.append(newValue)
            columnIndex = columnIndex + 1

        print '\t'.join(outputs)

if __name__ == "__main__":
    main(sys.argv[1:])

How to make kFirstColumns to be a command-line or some other kind of parameter to this Python script?

Thank you!

回答1:

Solution is really trivial. Use

ADD FILE replace-nan-with-zeros.py;

SELECT
  TRANSFORM (...)
  USING 'python replace-nan-with-zeros.py 7'
  AS (...)
FROM some_table;

instead of just

  ...
  USING 'python replace-nan-with-zeros.py'
  ...

It works fine for me.

Python script should be changed to:

kFirstColumns= int(sys.argv[1])


回答2:

Well, you are already sort of doing it.

You are grabbing sys.argv[1:] and passing it to main, but not using the arguments. What I would suggest (easiest route wise) would be to change your script as follows:

def main(kFirstColumns):
    ...

if __name__ == "__main__":
    main(int(sys.argv[1]))

Then run your script like

$ python myScript.py 7

Then, you can look at argparse when you want to do more complicated command line options.



回答3:

A bit of a hack, but you could pass the parameter by including it as an additional column in your query.

SELECT
  TRANSFORM (...)
  USING 'python replace-nan-with-zeros.py'
  AS (...)
FROM (SELECT 7 AS kFirstColumns, * FROM some_table);

Then, when you parse the row in your script, the first column value will be the parameter you are looking for. Simply pop it into your local variable to remove it from the list of column values.

line = line.strip();
inputs = line.split('\t')
kFirstColumns = inputs.pop(0)

Hope that helps.