Real-world problem:
I have data on directors across many firms, but sometimes "John Smith, director of XYZ" and "John Smith, director of ABC" are the same person, sometimes they're not. Also "John J. Smith, director of XYZ" and "John Smith, director of ABC" might be the same person, or might not be. Often examination of additional information (e.g., comparison of biographical data on "John Smith, director of XYZ" and "John Smith, director of ABC") makes it possible to resolve whether two observations are the same person or not.
Conceptual version of the problem:
In that spirit, am collecting data that will identify matching pairs. For example, suppose I have the following matching pairs: {(a, b), (b, c), (c, d), (d, e), (f, g)}
. I want to use the transitivity property of the relation "is the same person as" to generate "connected components" of {{a, b, c, d, e}, {f, g}}
. That is {a, b, c, d, e}
is one person and {f, g}
is another. (An earlier version of the question referred to "cliques", which are apparently something else; this would explain why find_cliques
in networkx
was giving the "wrong" results (for my purposes).
The following Python code does the job. But I wonder: is there a better (less computationally costly) approach (e.g., using standard or available libraries)?
There are examples here and there that seem related (e.g., Cliques in python), but these are incomplete, so I am not sure what libraries they are referring to or how to set up my data to use them.
Sample Python 2 code:
def get_cliques(pairs):
from sets import Set
set_list = [Set(pairs[0])]
for pair in pairs[1:]:
matched=False
for set in set_list:
if pair[0] in set or pair[1] in set:
set.update(pair)
matched=True
break
if not matched:
set_list.append(Set(pair))
return set_list
pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]
print(get_cliques(pairs))
This produces the desired output: [Set(['a', 'c', 'b', 'e', 'd']), Set(['g', 'f'])]
.
Sample Python 3 code:
This produces [set(['a', 'c', 'b', 'e', 'd']), set(['g', 'f'])]
):
def get_cliques(pairs):
set_list = [set(pairs[0])]
for pair in pairs[1:]:
matched=False
for a_set in set_list:
if pair[0] in a_set or pair[1] in a_set:
a_set.update(pair)
matched=True
break
if not matched:
set_list.append(set(pair))
return set_list
pairs = [('a', 'b'), ('b', 'c'), ('c', 'd'), ('d', 'e'), ('f', 'g')]
print(get_cliques(pairs))
DSM's comment made me look for set consolidation algorithms in Python. Rosetta Code has two versions of the same algorithm. Example use (the non-recursive version):
I don't believe (correct me if I'm wrong) that this is directly related to the largest clique problem. The definition of cliques (wikipedia) says that a clique "in an undirected graph is a subset of its vertices such that every two vertices in the subset are connected by an edge". In this case, we want to find which nodes can reach eachother (even indirectly).
I made a little sample. It builds a graph and traverses it looking for neighbors. This should be pretty efficient since each node is only traversed once when groups are formed.
If your data set is best modeled like a graph and really big, maybe a graph database such as Neo4j is appropriate?
With networkX:
giving:
You have to check the fastest algorithm now ...
OP:
This works great! I have this in my PostgreSQL database now. Just organize pairs into a two-column table, then use
array_agg()
to pass to PL/Python functionget_connected()
. Thanks.(Note: I edited answer, as I thought showing this step might be helpful addendum, but too long for a comment.)
I tried an alternate implementation using dictionaries as lookups and may have gotten a small reduction in computational latency.
And just to convince myself that it returns the right result (
get_cliques1
here is your original Python 2 solution):Timing info in seconds (with 10 million repetitions):
For the sake of completeness and reference, this is the full listing of both
cliques.py
and theget_times.py
timing script:So at least in this contrived scenario, there is a measurable speedup. It's admittedly not groundbreaking, and I'm sure I left some performance bits on the table in my implementation, but maybe it will help get you thinking about other alternatives?