I am moving from MPI based systems to Apache Spark. I need to do the following in Spark.
Suppose, I have n
vertices. I want to create an edge list from these n
vertices. An edge is just a tuple of two integers (u,v), no attributes are required.
However, I want to create them in parallel independently in each executor. Therefore, I want to create P
edge arrays independently for P
Spark Executors. Each array may be of different sizes and depends on the vertices, therefore, I also need the executor id from 0
to n-1
. Next, I want to have a global RDD Array of edges.
In MPI, I would create an array in each processor using the processor rank. How do I do that in Spark, especially using the GraphX
library?
Therefore, my primary goal is to create an array of edges in each executor and combine them into one single RDD.
I am first trying one modified version of the Erdos--Renyi model. As a parameter I only have the number of nodes n and a probability p.
Suppose, executor i
has to process nodes from 101
to 200
. For any node say, node 101
, it will create edges from 101
to 102 -- n
with probability p. After each executor creates the allocated edges, I would instantiate the GraphX EdgeRDD
and VertexRDD
. Therefore, my plan is to create the edge lists independently in each executor, and merge them into RDD
.
Lets start with some imports and variables which will be required for downstream processing:
Next we'll need an RDD of seed IDs which can be used to generate edges. A naive way to handle this would be simply something like this:
Since number of the generated edges depends on the node id this approach would give a highly skewed load. We can do a little bit better with repartitioning:
but much better approach is to start with empty RDD and generate ids in place. We'll need a small helper:
which can be used as follows:
Just a quick sanity check (it is quite expensive so don't use it in production):
and we can generate actual edges using another helper:
Finally we can create a graph: