I have dataset which has more than 50k nodes and I am trying to extract possible edges and communities from them. I did try using some graph tools like gephi, cytoscape, socnet, nodexl and so on to visualize and identify the edges and communities but the node list too large for those tools.. Hence I am trying to write script to exact the edge and communities. The other columns are connection start datetime and end datetime with GPS locations.
Input:
Id,starttime,endtime,gps1,gps2
0022d9064bc,1073260801,1073260803,819251,440006
00022d9064bc,1073260803,1073260810,819213,439954
00904b4557d3,1073260803,1073261920,817526,439458
00022de73863,1073260804,1073265410,817558,439525
00904b14b494,1073260804,1073262625,817558,439525
00904b14b494,1073260804,1073265163,817558,439525
00904b14b494,1073260804,1073263786,817558,439525
00022d1406df,1073260807,1073260809,820428,438735
00022d1406df,1073260807,1073260878,820428,438735
00022d623dfe,1073260810,1073276346,819251,440006
00022d7317d7,1073260810,1073276155,819251,440006
00022d9064bc,1073260810,1073272525,819251,440006
00022d9064bc,1073260810,1073260999,819251,440006
00022d9064bc,1073260810,1073260857,819251,440006
0030650c9eda,1073260811,1073260813,820356,439224
00022d0e0cec,1073260813,1073262843,820187,439271
00022d176cf3,1073260813,1073260962,817721,439564
000c30d8d2e8,1073260813,1073260902,817721,439564
00904b243bc4,1073260813,1073260962,817721,439564
00904b2fc34d,1073260813,1073260962,817721,439564
00904b52b839,1073260813,1073260962,817721,439564
00904b9a5a51,1073260813,1073260962,817721,439564
00904ba8b682,1073260813,1073260962,817721,439564
00022d3be9cd,1073260815,1073261114,819269,439403
00022d80381f,1073260815,1073261114,819269,439403
00022dc1b09c,1073260815,1073261114,819269,439403
00022d36a6df,1073260817,1073260836,820761,438607
00022d36a6df,1073260817,1073260845,820761,438607
003065d2d8b6,1073260817,1073267560,817735,439757
00904b0c7856,1073260817,1073265149,817735,439757
00022de73863,1073260825,1073260879,817558,439525
00904b14b494,1073260825,1073260879,817558,439525
00904b312d9e,1073260825,1073260879,817558,439525
00022d15b1c7,1073260826,1073260966,820353,439280
00022dcbe817,1073260826,1073260966,820353,439280
I am trying to implement undirected weighted / unweighted graph. Any help with suggestions for coding is highly appreciated.
Thanks in advance
Use Pandas to get the data into a pairwise node listing, where each row represents an edge, based on your edge criteria. Then migrate into a networkx
object for graph analysis.
The criteria for two nodes sharing an edge include:
- Same location Assuming this means same
gps1
AND gps2
.
- "Near same start and end time" This is a little ambiguous. For the purposes of this answer I've reduced this criterion to "start time in the same 5-second interval". It shouldn't be too hard to extend the
groupby
approach I've taken here if you want to apply additional temporal conditions on edges.
Since we want to manipulate data based on timestamps, convert start
and end
to datetime
dtype
:
df.start = pd.to_datetime(df.start, unit="s")
df.end = pd.to_datetime(df.end, unit="s")
df.start.describe()
count 35
unique 11
top 2004-01-05 00:00:13
freq 8
first 2004-01-05 00:00:01
last 2004-01-05 00:00:26
Name: start, dtype: object
df.head()
ID start end gps1 gps2
0 0022d9064bc 2004-01-05 00:00:01 2004-01-05 00:00:03 819251 440006
1 00022d9064bc 2004-01-05 00:00:03 2004-01-05 00:00:10 819213 439954
2 00904b4557d3 2004-01-05 00:00:03 2004-01-05 00:18:40 817526 439458
3 00022de73863 2004-01-05 00:00:04 2004-01-05 01:16:50 817558 439525
4 00904b14b494 2004-01-05 00:00:04 2004-01-05 00:30:25 817558 439525
The sample observations happen within a few seconds of each other, so we'll set the grouping frequency to be only a few seconds:
near = "5s"
Now groupby
location and start time to find connected nodes:
edges = (df.groupby(["gps1",
"gps2",
pd.Grouper(key="start",
freq=near,
closed="right",
label="right")],
as_index=False)
.agg({"ID":','.join,
"start":"min",
"end":"max"})
.reset_index()
.rename(columns={"index":"edge",
"start":"start_min",
"end":"end_max"})
)
edges.ID = edges.ID.str.split(",")
edges.head()
:
edge gps1 gps2 ID \
0 0 817526 439458 [00904b4557d3]
1 1 817558 439525 [00022de73863, 00904b14b494, 00904b14b494, 009...
2 2 817558 439525 [00022de73863, 00904b14b494, 00904b312d9e]
3 3 817721 439564 [00022d176cf3, 000c30d8d2e8, 00904b243bc4, 009...
4 4 817735 439757 [003065d2d8b6, 00904b0c7856]
start_min end_max
0 2004-01-05 00:00:03 2004-01-05 00:18:40
1 2004-01-05 00:00:04 2004-01-05 01:16:50
2 2004-01-05 00:00:25 2004-01-05 00:01:19
3 2004-01-05 00:00:13 2004-01-05 00:02:42
4 2004-01-05 00:00:17 2004-01-05 01:52:40
Each row now represents a unique edge category. ID
is a list of nodes in that all share that edge. It's a bit tricky to get this list into new structure of node-pairs; I've resorted to some old-fashioned nested for-loops. There's likely some Pandas-fu that can improve efficiency here:
Note: In the case of a singleton node, I've assigned a None
value to its pair. If you don't want to track singletons, just ignore the if not len(combos): ...
logic.
pairs = []
idx = 0
for e in edges.edge.values:
nodes = edges.loc[edges.edge==e, "ID"].values[0]
attrs = edges.loc[edges.edge==e, ["gps1","gps2","start_min","end_max"]]
combos = list(combinations(nodes, 2))
if not len(combos):
pair = [e, nodes[0], None]
pair.extend(attrs.values[0])
pairs.append(pair)
idx += 1
else:
for combo in combos:
pair = [e, combo[0], combo[1]]
pair.extend(attrs.values[0])
pairs.append(pair)
idx += 1
cols = ["edge","nodeA","nodeB","gps1","gps2","start_min","end_max"]
pairs_df = pd.DataFrame(pairs, columns=cols)
pairs_df.head()
:
edge nodeA nodeB gps1 gps2 start_min \
0 0 00904b4557d3 None 817526 439458 2004-01-05 00:00:03
1 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
2 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
3 1 00022de73863 00904b14b494 817558 439525 2004-01-05 00:00:04
4 1 00904b14b494 00904b14b494 817558 439525 2004-01-05 00:00:04
end_max
0 2004-01-05 00:18:40
1 2004-01-05 01:16:50
2 2004-01-05 01:16:50
3 2004-01-05 01:16:50
4 2004-01-05 01:16:50
Now the data can be fit to a networkx
object:
import networkx as nx
g = nx.from_pandas_dataframe(pairs_df, "nodeA", "nodeB", edge_attr=True)
# access edge attributes by node pairing:
test_A = "00022de73863"
test_B = "00904b14b494"
g[test_A][test_B]["start_min"]
# output:
Timestamp('2004-01-05 00:00:25')
For community detection, there are several options. Consider the networkx
community algorithms, as well as the community
module, which builds off of native networkx
functionality.
I read your question as mainly concerned with manipulating your data into a format suitable for network analysis. As this answer is lengthy enough already, I'll leave it to you to pursue community detection strategies - several methods can be used out-of-the-box with the modules I've linked to here.