I'm trying to find the connected components for friends in a city. My data is a list of edges with an attribute of city.
City | SRC | DEST
Houston Kyle -> Benny
Houston Benny -> Charles
Houston Charles -> Denny
Omaha Carol -> Brian
etc.
I know the connectedComponents function of pyspark's GraphX library will iterate over all the edges of a graph to find the connected components and I'd like to avoid that. How would I do so?
edit:
I thought I could do something like
select connected_components(*) from dataframe
groupby city
where connected_components generates a list of items.
Suppose your data is like this
import org.apache.spark._
import org.graphframes._
val l = List(("Houston","Kyle","Benny"),("Houston","Benny","charles"),
("Houston","Charles","Denny"),("Omaha","carol","Brian"),
("Omaha","Brian","Daniel"),("Omaha","Sara","Marry"))
var df = spark.createDataFrame(l).toDF("city","src","dst")
Create a list of cities for which you want to run connected components
cities = List("Houston","Omaha")
Now run a filter on city column for every city in cities list, then create an edge and vertex dataframes from the resulting dataframe. Create a graphframe from these edge and vertices dataframes and run connected components algorithm
val cities = List("Houston","Omaha")
for(city <- cities){
val edges = df.filter(df("city") === city).drop("city")
val vert = edges.select("src").union(edges.select("dst")).
distinct.select(col("src").alias("id"))
val g = GraphFrame(vert,edges)
val res = g.connectedComponents.run()
res.select("id", "component").orderBy("component").show()
}
Output
| id| component|
+-------+------------+
| Kyle|249108103168|
|charles|249108103168|
| Benny|249108103168|
|Charles|721554505728|
| Denny|721554505728|
+-------+------------+
+------+------------+
| id| component|
+------+------------+
| Marry|858993459200|
| Sara|858993459200|
| Brian|944892805120|
| carol|944892805120|
|Daniel|944892805120|
+------+------------+