Python package errors while running GCP Dataflow

2019-08-18 09:45发布

问题:

This error is a sudden occurrence. Nothing was changed on the Dataflow. We are seeing errors "NameError: global name 'firestore' is not defined [while running 'generatedPtransform-12478']" Looks like there is some problem installing the packages on the worker nodes

I tried the same pipeline locally on "DirectRunner" and it worked fine. We referred to the "NameErrors" documentation on the link "https://cloud.google.com/dataflow/docs/resources/faq#how-can-i-tell-what-version-of-the-cloud-dataflow-sdk-is-installedrunning-in-my-environment" and tried couple of things below

1.'save_main_session': True Pipeline parameter

2.moved all the package 'import' commands from global to function scope

We have the below packages in requirements.txt,

  • apache-beam[gcp]

  • google-cloud-firestore

  • python-dateutil
    import datetime
    import json
    import apache_beam as beam
    from apache_beam.options.pipeline_options import PipelineOptions
    from google.cloud import firestore
    import yaml
    from functools import reduce
    from dateutil.parser import parse

    class PubSubToDict(beam.DoFn):
         <...to process elements>

    class WriteToFS(beam.DoFn):
         <...to write data to firestore>

    pipeline_options = {
        'project': PROJECT,
        'staging_location': 'gs://' + BUCKET + '/staging',
        'temp_location': 'gs://' + BUCKET + '/temp',
        'runner': 'DataflowRunner',
        'job_name': JOB_NAME,
        'disk_size_gb': 100,
        'save_main_session': True,
        'region': 'europe-west1',
        'requirements_file': 'requirements.txt',
        'streaming': True
    }

    with beam.Pipeline(options=options) as p:

        lines = (p | "Read from PubSub" >> beam.io.ReadFromPubSub(topic=TOPIC).with_output_types(bytes)
                   | "Transformation" >> beam.ParDo(PubSubToDict()))

        FSWrite = (lines | 'Write To Firestore' >> beam.ParDo(WriteToFS()))```