How to find membership of vertices using Graphfram

2019-08-17 08:49发布

问题:

my input dataframe is df

    valx      valy 
1: 600060     09283744
2: 600131     96733110 
3: 600194     01700001

and I want to create the graph treating above two columns are edgelist and then my output should have list of all vertices of graph with its membership .

I have tried Graphframes in pyspark and networx library too, but not getting desired results

My output should look like below (its basically all valx and valy under V1 (as vertices) and their membership info under V2)

V1               V2
600060           1
96733110         1
01700001         3

I tried below

import networkx as nx
import pandas as pd

filelocation = r'Pathtodataframe df csv'

Panda_edgelist = pd.read_csv(filelocation)

g = nx.from_pandas_edgelist(Panda_edgelist,'valx','valy')
g2 = g.to_undirected(g)
list(g.nodes)
``

回答1:

I'm not sure if you are violating any rules here by asking the same question two times.

To detect communities with graphframes, at first you have to create graphframes object. Give your example dataframe the following code snippet shows you the necessary transformations:

from graphframes import *

sc.setCheckpointDir("/tmp/connectedComponents")


l = [
(  '600060'  , '09283744'),
(  '600131'  , '96733110'),
(  '600194'  , '01700001')
]

columns = ['valx', 'valy']

#this is your input dataframe 
edges = spark.createDataFrame(l, columns)

#graphframes requires two dataframes: an edge and a vertice dataframe.
#the edge dataframe has to have at least two columns labeled with src and dst.
edges = edges.withColumnRenamed('valx', 'src').withColumnRenamed('valy', 'dst')
edges.show()

#the vertice dataframe requires at least one column labeled with id
vertices = edges.select('src').union(edges.select('dst')).withColumnRenamed('src', 'id')
vertices.show()

g = GraphFrame(vertices, edges)

Output:

+------+--------+ 
|   src|     dst| 
+------+--------+ 
|600060|09283744| 
|600131|96733110| 
|600194|01700001| 
+------+--------+ 
+--------+ 
|      id| 
+--------+ 
|  600060| 
|  600131| 
|  600194| 
|09283744| 
|96733110| 
|01700001| 
+--------+

You wrote in the comments of your other question that the community detection algorithmus doesn't matter for you currently. Therefore I will pick the connected components:

result = g.connectedComponents()
result.show()

Output:

+--------+------------+ 
|      id|   component| 
+--------+------------+ 
|  600060|163208757248| 
|  600131| 34359738368| 
|  600194|884763262976| 
|09283744|163208757248| 
|96733110| 34359738368| 
|01700001|884763262976| 
+--------+------------+

Other community detection algorithms (like LPA) can be found in the user guide.