I've designed a data transformation in Dataprep and am now attempting to run it by using the template in Dataflow. My flow has several inputs and outputs - the dataflow template provides them as a json object with key/value pairs for each input & location. They look like this (line breaks added for easy reading):
{
"location1": "project:bq_dataset.bq_table1",
#...
"location10": "project:bq_dataset.bq_table10",
"location17": "project:bq_dataset.bq_table17"
}
I have 17 inputs (mostly lookups) and 2 outputs (one csv, one bigquery). I'm passing these to the gcloud
CLI like this:
gcloud dataflow jobs run job-201807301630 /
--gcs-location=gs://bucketname/dataprep/dataprep_template /
--parameters inputLocations={"location1":"project..."},outputLocations={"location1":"gs://bucketname/output.csv"}
But I'm getting an error:
ERROR: (gcloud.dataflow.jobs.run) unrecognized arguments:
inputLocations=location1:project:bq_dataset.bq_table1,outputLocations=location2:project:bq_dataset.bq_output1
inputLocations=location10:project:bq_dataset.bq_table10,outputLocations=location1:gs://bucketname/output.csv
From the error message, it looks to be merging the inputs and outputs so that as I have two outputs, each two inputs are paired with the two outputs:
input1:output1
input2:output2
input3:output1
input4:output2
input5:output1
input6:output2
...
I've tried quoting the input/output objects (single and double, plus removing the quotes in the object), wrapping them in []
, using tildes but no joy. Has anyone managed to execute a dataflow job with multiple inputs?
I finally found a solution for this via a huge process of trial and error. There are several steps involved.
Format of
--parameters
The
--parameters
argument is a dictionary-type argument. There are details on these in a document you can read by typinggcloud topic escaping
in the CLI, but in short it means you'll need an=
between--parameters
and the arguments, and then the format is key=value pairs with the value enclosed in quote marks ("
):Escape the objects
Then, the objects need the quotes escaping to avoid ending the value prematurely, so
Becomes
Choose a different separator
Next, the CLI gets confused because while the key=value pairs are separated by a comma, the values also have commas in the objects. So you can define a different separator by putting it between carats (
^
) at the start of the argument and between the key=value pairs:I used
*
because;
didn't work - maybe because it marks the end of the CLI command? Who knows.Note also that the
gcloud topic escaping
info says:Don't forget
customGcsTempLocation
After all that, I'd forgotten that
customGcsTempLocation
needs adding to the key=value pairs in the--parameters
argument. Don't forget to separate it from the others with a*
and enclose it in quote marks again:Pretty much none of this is explained in the online documentation, so that's several days of my life I won't get back - hopefully I've helped someone else with this.