So if I have a matrix (list of lists) where each column represents a unique word, each row represents a distinct document, and every entry is a 1 or 0, indicating whether or not the word for a given column exists in the document for a given row.
What I'd like to know is how to determine all the possible combinations of words and documents where more than one word is in common with more than one document. The result might look something like:
[ [[Docid_3, Docid_5], ['word1', 'word17', 'word23']],
[[Docid_3, Docid_9, Docid_334], ['word2', 'word7', 'word23', 'word68', 'word982']],
...
and so on for each possible combination. Would love a solution that provides the complete set of combinations and one that yields only the combinations that are not a subset of another, so from the example, not [[Docid_3, Docid_5], ['word1', 'word17']]
since it's a complete subset of the first example.
I feel like there is an elegant solution that just isn't coming to mind and the beer isn't helping.
Thanks.
First, build a mapping from document ID to set of words -- your matrix of 0
and 1
is quite an unwieldy structure to process directly. If I read you correctly, the "column headings" (words) are the first list in the matrix (minus presumably the first item) and the "row headings" (docids) are the first items of each row (minus presumably the first row). Then (assuming Python 2.6 or better):
def makemap(matrix):
im = iter(matrix)
words = next(im)[1:]
themap = {}
for row in im:
mapent = set()
docid = row[0]
for w, bit in zip(words, row[1:]):
try:
if bit: mapent.add(w)
except:
print 'w is %r' % (w,)
raise
themap[docid] = mapent
return themap
Now you need to check all feasible subsets of documents -- the total number of subsets is huge so you really want to prune that search tree as much as you can, and brute-force generation of all subsets (e.g. by looping on itertools.combinations
for various lengths) will not perform any pruning of course.
I would start with all 2-combinations (all pairs of docids -- itertools.combinations
is fine for this of course) and make the first batch (those pairs which have 2+ words in commons) of "feasible 2-length subsets". That can go in another mapping with tuple
s or frozenset
s of docids as the keys.
Then, to make the feasible (N+1)-length subsets, I would only try to extend existing feasible N-length subsets by one more docid each (checking the total intersection is still 2+ long of course). This at least does some pruning rather than blindly trying all the 2**N
subsets (or even just the 2**N - N - 1
subsets of length at least two;-).
It might perhaps be possible to do even better by recording all docids that proved unable to extend a certain N-length subset -- no point in trying those against any of the (N+1)-length subsets derived from it. This is worth trying as a second level of pruning/optimization.
There may be further tweaks yet you might do for further pruning but offhand none immediately comes to mind, so that's where I'd start from. (For readability, I'm not bothering below with using microoptimizations such as iteritems
in lieu of items
, frozenset
s in lieu of tuple
s, etc -- they're probably marginal given those sequences are all O(N)
vs the exponential size of computed structures, although of course worth trying in the tuning/optimizing phase).
def allfeasiblesubsets(matrix):
mapping = makemap(matrix)
docids = sorted(mapping.keys())
feasible_len2 = {}
dont_bother = dict((d, set([d])) for d in docids)
for d1, d2 in itertools.combinations(docids, 2):
commonw = mapping[d1].intersection(mapping[d2])
if len(commonw) >= 2:
feasible_len2[d1, d2] = commonw
else:
dont_bother[d1].add(d2)
dont_bother[d2].add(d1)
all_feasible = [feasible_len2]
while all_feasible[-1]:
feasible_Np1 = {}
for ds, ws in all_feasible[-1].items():
md = max(ds)
for d, w in mapping.items():
if d <= md or any(d in dont_bother[d1] for d1 in ds):
continue
commonw = w.intersection(ws)
if len(commonw) >= 2:
feasible_Np1[ds+(d,)] = commonw
all_feasible.append(feasible_Np1)
return all_feasible[:-1]
You'll notice I've applied only a mild form of my suggested "further pruning" -- dont_bother
only records "incompatibilities" (<2 words in common) between one docid and others -- this may help if there are several pairs of such "incompatible docids", and is simple and reasonably unobtrusive, but is not as powerful in pruning as the harder "full" alternative. I'm also trying to keep all keys in the feasible*
dicts as sorted tuples of docids (as the itertools.combinations
originally provides for the pairs) to avoid duplications and therefore redundant work.
Here's the small example I've tried (in the same file as these functions after, of course, the import
for itertools
and collections
):
mat = [ ['doc']+'tanto va la gatta al lardo che ci lascia lo zampino'.split(),
['uno', 0, 0, 0, 1, 0, 1, 0, 0, 0, 1],
['due', 1, 0, 0, 0, 0, 1, 0, 1, 0, 1],
['tre', 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
['qua', 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]]
mm = makemap(mat)
print mm
afs = allfeasiblesubsets(mat)
print afs
The results, which appear OK, are:
{'qua': set(['gatta', 'lo', 'ci', 'lardo']), 'tre': set(['lo', 'ci', 'tanto']), 'due': set(['lo', 'ci', 'lardo', 'tanto']), 'uno': set(['gatta', 'lo', 'lardo'])}
[{('due', 'tre'): set(['lo', 'ci', 'tanto']), ('due', 'uno'): set(['lo', 'lardo']), ('qua', 'uno'): set(['gatta', 'lo', 'lardo']), ('due', 'qua'): set(['lo', 'ci', 'lardo']), ('qua', 'tre'): set(['lo', 'ci'])}, {('due', 'qua', 'tre'): set(['lo', 'ci']), ('due', 'qua', 'uno'): set(['lo', 'lardo'])}]
but of course there might still be bugs lurking since I haven't tested it thoroughly. BTW, I hope it's clear that the result as supplied here (a list of dicts for various increasing lengths, each dict having the ordered tuple forms of the docids-sets as keys and the sets of their common words as values) can easily be post-processed into any other form you might prefer, such as nested lists.
(Not that it matters, but the text I'm using in the example is an old Italian proverb;-).
Take a look at
SO what-tried-and-true-algorithms-for-suggesting-related-articles-are-out-there
For real problem sizes, say > 100 docs, 10000 words,
get the nice bitarray module
(which says, by the way,
"the same algorithm in Python ... is about 20 times slower than in C").
On "only the combinations that are not a subset of another":
define a hit22 as a 2x2 submatrix with 11 11,
a hit23 as a 2x3 submatrix with 111 111 (2 docs, 3 words in common), and so on.
A given hit22 may be in many hit2n s — 2 docs, n words,
and also in many hitn2 s — n docs, 2 words. Looks fun.
Added Monday 14Jun: little functions using bitarray.
(Intro / python modules for real doc classification ? Dunno.)
""" docs-words with bitarray, randombits """
# google "document classification" (tutorial | python) ...
# https://stackoverflow.com/questions/1254627/what-tried-and-true-algorithms-for-suggesting-related-articles-are-out-there
from __future__ import division
import random
import sys
from bitarray import bitarray # http://pypi.python.org/pypi/bitarray
__date__ = "14jun 2010 denis"
ndoc = 100
nbits = 1000
exec "\n".join( sys.argv[1:] ) # run this.py ndoc= ...
random.seed(1)
me = __file__.split('/') [-1]
print "%s ndoc=%d nbits=%d" % (me, ndoc, nbits)
# bitarray stuff --
def bitslist( bits ):
""" 011001 -> [1,2,5] """
return [ j for j in range(len(bits)) if bits[j] ]
hex_01 = {
"0": "0000", "1": "0001", "2": "0010", "3": "0011",
"4": "0100", "5": "0101", "6": "0110", "7": "0111",
"8": "1000", "9": "1001", "a": "1010", "b": "1011",
"c": "1100", "d": "1101", "e": "1110", "f": "1111",
}
def to01( x, len_ ):
x = "%x" % x
s = "".join( hex_01[c] for c in x )
return (len_ - len(s)) * "0" + s
def randombits( nbits ):
""" -> bitarray 1/16 1, 15/16 0 """
hibit = 1 << (nbits - 1)
r = (random.randint( 0, hibit - 1 )
& random.randint( 0, hibit - 1 )
& random.randint( 0, hibit - 1 )
& random.randint( 0, hibit - 1 )) # prob 1/16
return bitarray( to01( r, nbits ))
#...............................................................................
doc = [ randombits(nbits) for j in range(ndoc) ] # ndoc x nbits
def mostsimilarpair():
""" -> (sim, j, k) most similar pair of docs """
mostsim = (-1,-1,-1)
for j in range(ndoc):
for k in range(j+1, ndoc):
# allpairs[j,k] -> scipy.cluster.hier ?
sim = (doc[j] & doc[k]).count() # nr bits (words) in common, crude
mostsim = max( mostsim, (sim,j,k) )
return mostsim
sim, jdoc, kdoc = mostsimilarpair()
print "The 2 most similar docs:" ,
print "doc %d has %d words," % ( jdoc, doc[jdoc].count() ) ,
print "doc %d has %d," % ( kdoc, doc[kdoc].count() )
print "%d words in common: %s" % ( sim, bitslist( doc[jdoc] & doc[kdoc] ))
print ""
#...............................................................................
def docslike( jdoc, thresh ):
""" -> (doc index, sim >= thresh) ... """
for j in range(ndoc):
if j == jdoc: continue
sim = (doc[j] & doc[jdoc]).count()
if sim >= thresh:
yield (j, sim)
thresh = sim // 2
print "Docs like %d, with >= %d words in common:" % (jdoc, thresh)
for j, sim in docslike( jdoc, thresh ):
print "%3d %s" % ( j, bitslist( doc[j] & doc[jdoc] ))
"""
The 2 most similar docs: doc 72 has 66 words, doc 84 has 60,
12 words in common: [11, 51, 119, 154, 162, 438, 592, 696, 800, 858, 860, 872]
Docs like 72, with >= 6 words in common:
2 [3, 171, 258, 309, 592, 962]
...
"""