Executing a Dataflow job with multiple inputs/outp

2019-08-05 06:14发布

问题:

I've designed a data transformation in Dataprep and am now attempting to run it by using the template in Dataflow. My flow has several inputs and outputs - the dataflow template provides them as a json object with key/value pairs for each input & location. They look like this (line breaks added for easy reading):

{
    "location1": "project:bq_dataset.bq_table1",
    #...
    "location10": "project:bq_dataset.bq_table10",
    "location17": "project:bq_dataset.bq_table17"
}

I have 17 inputs (mostly lookups) and 2 outputs (one csv, one bigquery). I'm passing these to the gcloud CLI like this:

gcloud dataflow jobs run job-201807301630 /
    --gcs-location=gs://bucketname/dataprep/dataprep_template /
    --parameters inputLocations={"location1":"project..."},outputLocations={"location1":"gs://bucketname/output.csv"}

But I'm getting an error:

ERROR: (gcloud.dataflow.jobs.run) unrecognized arguments:
inputLocations=location1:project:bq_dataset.bq_table1,outputLocations=location2:project:bq_dataset.bq_output1
inputLocations=location10:project:bq_dataset.bq_table10,outputLocations=location1:gs://bucketname/output.csv

From the error message, it looks to be merging the inputs and outputs so that as I have two outputs, each two inputs are paired with the two outputs:

input1:output1
input2:output2
input3:output1
input4:output2
input5:output1
input6:output2
...

I've tried quoting the input/output objects (single and double, plus removing the quotes in the object), wrapping them in [], using tildes but no joy. Has anyone managed to execute a dataflow job with multiple inputs?

回答1:

I finally found a solution for this via a huge process of trial and error. There are several steps involved.

Format of --parameters

The --parameters argument is a dictionary-type argument. There are details on these in a document you can read by typing gcloud topic escaping in the CLI, but in short it means you'll need an = between --parameters and the arguments, and then the format is key=value pairs with the value enclosed in quote marks ("):

--parameters=inputLocations="object",outputLocations="object"

Escape the objects

Then, the objects need the quotes escaping to avoid ending the value prematurely, so

{"location1":"gcs://bucket/whatever"...

Becomes

{\"location1\":\"gcs://bucket/whatever\"...

Choose a different separator

Next, the CLI gets confused because while the key=value pairs are separated by a comma, the values also have commas in the objects. So you can define a different separator by putting it between carats (^) at the start of the argument and between the key=value pairs:

--parameters=^*^inputLocations="{"\location1\":\"...\"}"*outputLocations="{"\location1\":\"...\"}"

I used * because ; didn't work - maybe because it marks the end of the CLI command? Who knows.

Note also that the gcloud topic escaping info says:

In cmd.exe and PowerShell on Windows, ^ is a special character and you must escape it by repeating it. In the following examples, every time you see ^, replace it with ^^^^.

Don't forget customGcsTempLocation

After all that, I'd forgotten that customGcsTempLocation needs adding to the key=value pairs in the --parameters argument. Don't forget to separate it from the others with a * and enclose it in quote marks again:

...}*customGcsTempLocation="gs://bucket/whatever"

Pretty much none of this is explained in the online documentation, so that's several days of my life I won't get back - hopefully I've helped someone else with this.