Same as title, can Dataflow use not temporary created VM instances but the one already made?
问题:
回答1:
After asking the OP about the reason for the request which was then pointed to as a reply, I am going to offer the following as potential answer:
The power behind Dataflow is to achieve a high degree of parallelism when processing data pipelines. The back-story of the original request was that "something" was working when run as a local runner but not working as desired when using Dataflow as a runner. This then appears to have resulted in the OP thinking "we'll just run Dataflow using the local runner". In my opinion, that isn't a great idea. One uses the localrunner for development and unit testing. A local runner doesn't provide any form of horizontal scaling ... it literally runs on just one machine.
When one runs a pipeline job on distributed Dataflow, it creates as many workers as needed to sensibly distribute the job across many machines. If the job then wishes to generate a result as file output ... the question then becomes "Where will that data be written?". The answer can't be a local file relative to where the Dataflow job was run because, by definition, the job was run across multiple machines and there is no notion of one machine as the "output". To solve this problem, data should be output to Google Cloud Storage which is a common storage area visible to all machines. The related question posed by the OP describes a potential problem with writing data to GCS as opposed to local file (as found with local runner) but I believe that is the problem to be solved (i.e. how to write to centralized GCS storage correctly) rather than try and use a single VM. Dataflow provides ZERO control over nature of the dataflow processing engines (workers). They are logically ephemeral and are "just there" to process Dataflow work.