可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

So if I have a matrix (list of lists) where each column represents a unique word, each row represents a distinct document, and every entry is a 1 or 0, indicating whether or not the word for a given column exists in the document for a given row.

What I'd like to know is how to determine all the possible combinations of words and documents where more than one word is in common with more than one document. The result might look something like:

[ [[Docid_3, Docid_5], ['word1', 'word17', 'word23']], 
  [[Docid_3, Docid_9, Docid_334], ['word2', 'word7', 'word23', 'word68', 'word982']],
  ...

and so on for each possible combination. Would love a solution that provides the complete set of combinations and one that yields only the combinations that are not a subset of another, so from the example, not [[Docid_3, Docid_5], ['word1', 'word17']] since it's a complete subset of the first example.

I feel like there is an elegant solution that just isn't coming to mind and the beer isn't helping.

Thanks.

回答1:

Normalize the text. You only want strings made of string.lowercase. Split/strip on everything else.
Make sets out of this.

Use something like this to get all possible groupings of all sizes:

def get_all_lengths_combinations_of(elements):
  for no_of_items in range(2, len(elements)+1):
    for items in itertools.combinations(elements, no_of_items)
        yield items

I'm sure the real itertools wizards will come up with something better, possibly involving izip().

Remember you should be able to use the set.intersection() method like this:
```
set.intersection(*list_of_sets_to_intersect)
```

回答2:

First, build a mapping from document ID to set of words -- your matrix of 0 and 1 is quite an unwieldy structure to process directly. If I read you correctly, the "column headings" (words) are the first list in the matrix (minus presumably the first item) and the "row headings" (docids) are the first items of each row (minus presumably the first row). Then (assuming Python 2.6 or better):

def makemap(matrix):
  im = iter(matrix)
  words = next(im)[1:]
  themap = {}
  for row in im:
    mapent = set()
    docid = row[0]
    for w, bit in zip(words, row[1:]):
      try:
        if bit: mapent.add(w)
      except:
        print 'w is %r' % (w,)
        raise
    themap[docid] = mapent
  return themap

Now you need to check all feasible subsets of documents -- the total number of subsets is huge so you really want to prune that search tree as much as you can, and brute-force generation of all subsets (e.g. by looping on itertools.combinations for various lengths) will not perform any pruning of course.

I would start with all 2-combinations (all pairs of docids -- itertools.combinations is fine for this of course) and make the first batch (those pairs which have 2+ words in commons) of "feasible 2-length subsets". That can go in another mapping with tuples or frozensets of docids as the keys.

Then, to make the feasible (N+1)-length subsets, I would only try to extend existing feasible N-length subsets by one more docid each (checking the total intersection is still 2+ long of course). This at least does some pruning rather than blindly trying all the 2**N subsets (or even just the 2**N - N - 1 subsets of length at least two;-).

It might perhaps be possible to do even better by recording all docids that proved unable to extend a certain N-length subset -- no point in trying those against any of the (N+1)-length subsets derived from it. This is worth trying as a second level of pruning/optimization.

There may be further tweaks yet you might do for further pruning but offhand none immediately comes to mind, so that's where I'd start from. (For readability, I'm not bothering below with using microoptimizations such as iteritems in lieu of items, frozensets in lieu of tuples, etc -- they're probably marginal given those sequences are all O(N) vs the exponential size of computed structures, although of course worth trying in the tuning/optimizing phase).

def allfeasiblesubsets(matrix):
  mapping = makemap(matrix)
  docids = sorted(mapping.keys())
  feasible_len2 = {}
  dont_bother = dict((d, set([d])) for d in docids)
  for d1, d2 in itertools.combinations(docids, 2):
    commonw = mapping[d1].intersection(mapping[d2])
    if len(commonw) >= 2:
      feasible_len2[d1, d2] = commonw
    else:
      dont_bother[d1].add(d2)
      dont_bother[d2].add(d1)
  all_feasible = [feasible_len2]
  while all_feasible[-1]:
    feasible_Np1 = {}
    for ds, ws in all_feasible[-1].items():  
      md = max(ds)        
      for d, w in mapping.items():
        if d <= md or any(d in dont_bother[d1] for d1 in ds):
          continue
        commonw = w.intersection(ws)
        if len(commonw) >= 2:
          feasible_Np1[ds+(d,)] = commonw
    all_feasible.append(feasible_Np1)
  return all_feasible[:-1]

You'll notice I've applied only a mild form of my suggested "further pruning" -- dont_bother only records "incompatibilities" (<2 words in common) between one docid and others -- this may help if there are several pairs of such "incompatible docids", and is simple and reasonably unobtrusive, but is not as powerful in pruning as the harder "full" alternative. I'm also trying to keep all keys in the feasible* dicts as sorted tuples of docids (as the itertools.combinations originally provides for the pairs) to avoid duplications and therefore redundant work.

Here's the small example I've tried (in the same file as these functions after, of course, the import for itertools and collections):

mat = [ ['doc']+'tanto va la gatta al lardo che ci lascia lo zampino'.split(),
        ['uno', 0, 0, 0, 1, 0, 1, 0, 0, 0, 1],
        ['due', 1, 0, 0, 0, 0, 1, 0, 1, 0, 1],
        ['tre', 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
        ['qua', 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]]

mm = makemap(mat)
print mm
afs = allfeasiblesubsets(mat)
print afs

The results, which appear OK, are:

{'qua': set(['gatta', 'lo', 'ci', 'lardo']), 'tre': set(['lo', 'ci', 'tanto']), 'due': set(['lo', 'ci', 'lardo', 'tanto']), 'uno': set(['gatta', 'lo', 'lardo'])}
[{('due', 'tre'): set(['lo', 'ci', 'tanto']), ('due', 'uno'): set(['lo', 'lardo']), ('qua', 'uno'): set(['gatta', 'lo', 'lardo']), ('due', 'qua'): set(['lo', 'ci', 'lardo']), ('qua', 'tre'): set(['lo', 'ci'])}, {('due', 'qua', 'tre'): set(['lo', 'ci']), ('due', 'qua', 'uno'): set(['lo', 'lardo'])}]

but of course there might still be bugs lurking since I haven't tested it thoroughly. BTW, I hope it's clear that the result as supplied here (a list of dicts for various increasing lengths, each dict having the ordered tuple forms of the docids-sets as keys and the sets of their common words as values) can easily be post-processed into any other form you might prefer, such as nested lists.

(Not that it matters, but the text I'm using in the example is an old Italian proverb;-).

回答3:

Take a look at SO what-tried-and-true-algorithms-for-suggesting-related-articles-are-out-there

For real problem sizes, say > 100 docs, 10000 words, get the nice bitarray module
(which says, by the way, "the same algorithm in Python ... is about 20 times slower than in C").

On "only the combinations that are not a subset of another": define a hit22 as a 2x2 submatrix with 11 11, a hit23 as a 2x3 submatrix with 111 111 (2 docs, 3 words in common), and so on.
A given hit22 may be in many hit2n s — 2 docs, n words,
and also in many hitn2 s — n docs, 2 words. Looks fun.

Added Monday 14Jun: little functions using bitarray.
(Intro / python modules for real doc classification ? Dunno.)

""" docs-words with bitarray, randombits """
# google "document classification" (tutorial | python) ...
# https://stackoverflow.com/questions/1254627/what-tried-and-true-algorithms-for-suggesting-related-articles-are-out-there

from __future__ import division
import random
import sys
from bitarray import bitarray  # http://pypi.python.org/pypi/bitarray

__date__ = "14jun 2010 denis"

ndoc = 100
nbits = 1000

exec "\n".join( sys.argv[1:] )  # run this.py ndoc= ...
random.seed(1)
me = __file__.split('/') [-1]
print "%s  ndoc=%d  nbits=%d" % (me, ndoc, nbits)

    # bitarray stuff --

def bitslist( bits ):
    """ 011001 -> [1,2,5] """
    return [ j for j in range(len(bits)) if bits[j] ]

hex_01 = {
    "0": "0000", "1": "0001", "2": "0010", "3": "0011",
    "4": "0100", "5": "0101", "6": "0110", "7": "0111",
    "8": "1000", "9": "1001", "a": "1010", "b": "1011",
    "c": "1100", "d": "1101", "e": "1110", "f": "1111",
}

def to01( x, len_ ):
    x = "%x" % x
    s = "".join( hex_01[c] for c in x )
    return (len_ - len(s)) * "0"  + s

def randombits( nbits ):
    """ -> bitarray 1/16 1, 15/16 0 """
    hibit = 1 << (nbits - 1)
    r = (random.randint( 0, hibit - 1 )
        & random.randint( 0, hibit - 1 )
        & random.randint( 0, hibit - 1 )
        & random.randint( 0, hibit - 1 ))  # prob 1/16
    return bitarray( to01( r, nbits ))

#...............................................................................
doc = [ randombits(nbits) for j in range(ndoc) ]  # ndoc x nbits

def mostsimilarpair():
    """ -> (sim, j, k) most similar pair of docs """
    mostsim = (-1,-1,-1)
    for j in range(ndoc):
        for k in range(j+1, ndoc):
                # allpairs[j,k] -> scipy.cluster.hier ?
            sim = (doc[j] & doc[k]).count()  # nr bits (words) in common, crude
            mostsim = max( mostsim, (sim,j,k) )
    return mostsim

sim, jdoc, kdoc = mostsimilarpair()
print "The 2 most similar docs:" ,
print "doc %d has %d words," % ( jdoc, doc[jdoc].count() ) ,
print "doc %d has %d," % ( kdoc, doc[kdoc].count() )
print "%d words in common: %s" % ( sim, bitslist( doc[jdoc] & doc[kdoc] ))
print ""

#...............................................................................
def docslike( jdoc, thresh ):
    """ -> (doc index, sim >= thresh) ... """
    for j in range(ndoc):
        if j == jdoc:  continue
        sim = (doc[j] & doc[jdoc]).count()
        if sim >= thresh:
            yield (j, sim)

thresh = sim // 2
print "Docs like %d, with >= %d words in common:" % (jdoc, thresh)
for j, sim in docslike( jdoc, thresh ):
    print "%3d  %s" % ( j, bitslist( doc[j] & doc[jdoc] ))

"""
The 2 most similar docs: doc 72 has 66 words, doc 84 has 60,
12 words in common: [11, 51, 119, 154, 162, 438, 592, 696, 800, 858, 860, 872]

Docs like 72, with >= 6 words in common:
  2  [3, 171, 258, 309, 592, 962]
...
"""