Returning a large data structure from Dataflow wor

2019-07-31 16:09发布

I have large graph ~100k vertices and ~1 million edges being constructed in a DoFn function. When I try to output that graph in DoFn function execution gets stuck at c.output(graph);.

    public static class Prep extends DoFn<TableRow, TableRows> {

        @Override
        public void processElement(ProcessContext c) {
            //Graph creation logic runs very fast, no problem here

            LOG.info("Starting Graph Output");  // can see this in logs
            c.output(graph); //outputs data from DoFn function
            LOG.info("Ending Graph Output"); // never see this logs
    }
  }

My graph class is just a Map of vertices being serialized with AvroCoder.

import org.apache.avro.reflect.Nullable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.X.Prep;
import com.google.cloud.dataflow.sdk.coders.AvroCoder;
import com.google.cloud.dataflow.sdk.coders.DefaultCoder;

//Class that creates Graph data structure for custom seg definitions 
@DefaultCoder(AvroCoder.class)
public class MyGraph {
  @Nullable
  public Map<String,GraphVertex> vertexList = new HashMap<String,GraphVertex>(); 
}

I have tried json-simple, gson, jackson json serialization all of them take too long to serialize this graph.

1条回答
▲ chillily
2楼-- · 2019-07-31 16:38

The graph object is likely too large to be encoded and passed around as an element. You should explore other mechanisms for getting the graph to workers. For example, creating a multi-map-valued side input (keyed by vertex). This would allow you to have a PCollection (processed in parallel).

Alternatively, since the graph creation logic runs very fast just run that logic on each worker, rather than trying to serialize the entire graph.

查看更多
登录 后发表回答