Dataproc hive operator not running hql file stored

2019-02-24 09:09发布

问题:

I am trying to run hql file present in cloud storage using airflow script,there are two parameters through which we can pass the path to DataprocHiveOperator :

  1. Query: 'gs://bucketpath/filename.q'

Error occuring - cannot recognize input near 'gs' ':' '/'

  1. query_uri :'gs://bucketpath/filename.q'

Error occuring: PendingDeprecationWarning: Invalid arguments were passed to DataProcHiveOperator. Support for passing such arguments will be dropped in Airflow 2.0. Invalid arguments were: *args: () **kwargs: {'query_uri': 'gs://poonamp_pcloud/hive_file1.q'

Using Query param , i have successfully run hive queries(select * from table)

Is there any way to run hql file stored in cloud storage bucket through dataprochiveoperator ?

回答1:

That's because you are using both query as well as query_uri.

If you are querying using a file, you have to use query_uri and query = None OR you can ignore writing query.

If you are using inline query then you have to use query.

Here is a sample for querying through a file.

HiveInsertingTable = DataProcHiveOperator(task_id='HiveInsertingTable',
    gcp_conn_id='google_cloud_default', 
    queri_ury="gs://us-central1-bucket/data/sample_hql.sql",
    cluster_name='cluster-name',
    region='us-central1',
    dag=dag)


回答2:

query_uri is the indeed the correct parameter for running an hql file off of Cloud Storage. However, it was only added to the DataProcHiveOperator in https://github.com/apache/incubator-airflow/pull/2402. Based on the warning message you got, I don't think you're running on code that supports the parameter. The change is not on the latest release (1.8.2), so you'll need to wait for another release or grab it off the master branch.