I'm trying to find the connected components for friends in a city. My data is a list of edges with an attribute of city.
City | SRC | DEST
Houston Kyle -> Benny
Houston Benny -> Charles
Houston Charles -> Denny
Omaha Carol -> Brian
etc.
I know the connectedComponents function of pyspark's GraphX library will iterate over all the edges of a graph to find the connected components and I'd like to avoid that. How would I do so?
edit: I thought I could do something like
select connected_components(*) from dataframe groupby city
where connected_components generates a list of items.
Suppose your data is like this
Create a list of cities for which you want to run connected components
cities = List("Houston","Omaha")
Now run a filter on city column for every city in cities list, then create an edge and vertex dataframes from the resulting dataframe. Create a graphframe from these edge and vertices dataframes and run connected components algorithm
Output