Returning a large data structure from Dataflow wor

2019-07-31 16:04发布

问题:

I have large graph ~100k vertices and ~1 million edges being constructed in a DoFn function. When I try to output that graph in DoFn function execution gets stuck at c.output(graph);.

    public static class Prep extends DoFn<TableRow, TableRows> {

        @Override
        public void processElement(ProcessContext c) {
            //Graph creation logic runs very fast, no problem here

            LOG.info("Starting Graph Output");  // can see this in logs
            c.output(graph); //outputs data from DoFn function
            LOG.info("Ending Graph Output"); // never see this logs
    }
  }

My graph class is just a Map of vertices being serialized with AvroCoder.

import org.apache.avro.reflect.Nullable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.X.Prep;
import com.google.cloud.dataflow.sdk.coders.AvroCoder;
import com.google.cloud.dataflow.sdk.coders.DefaultCoder;

//Class that creates Graph data structure for custom seg definitions 
@DefaultCoder(AvroCoder.class)
public class MyGraph {
  @Nullable
  public Map<String,GraphVertex> vertexList = new HashMap<String,GraphVertex>(); 
}

I have tried json-simple, gson, jackson json serialization all of them take too long to serialize this graph.

回答1:

The graph object is likely too large to be encoded and passed around as an element. You should explore other mechanisms for getting the graph to workers. For example, creating a multi-map-valued side input (keyed by vertex). This would allow you to have a PCollection (processed in parallel).

Alternatively, since the graph creation logic runs very fast just run that logic on each worker, rather than trying to serialize the entire graph.