I have a sparse matrix that I obtained by using Sklearn's TfidfVectorizer object:
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', vocabulary=my_vocab, stop_words='english')
tfidf = vect.fit_transform([my_docs])
The sparse matrix is (taking out the numbers for generality):
<sparse matrix of type '<type 'numpy.float64'>'
with stored elements in Compressed Sparse Row format>]
I am trying to get a numeric value for each row to tell me how high a document had the terms I am looking for. I don't really care about which words it contained, I just want to know how many it contained. So I want to get the norm of each or the row*row.T. However, I am having a very hard time working with numpy to obtain this.
My first approach was to just simply do:
tfidf[i] * numpy.transpose(tfidf[i])
However, numpy will apparently not transpose an array with less than one dimension so that will just square the vector. So I tried doing:
tfidf[i] * numpy.transpose(numpy.atleast_2d(tfidf[0]))
But numpy.transpose(numpy.atleast_2d(tfidf[0])) still would not transpose the row.
I moved on to trying to get the norm of the row (that approach is probably better anyways). My initial approach was using numpy.linalg.
numpy.linalg.norm(tfidf[0])
But that gave me a "dimension mismatch" error. So I tried to calculate the norm manually. I started by just setting a variable equal to a numpy array version of the sparse matrix and printing out the len of the first row:
my_array = numpy.array(tfidf)
print my_array
print len(my_array[0])
It prints out my_array correctly, but when I try to access the len it tells me:
IndexError: 0-d arrays can't be indexed
I just simply want to get a numeric value of each row in the sparse matrix returned by fit_transform. Getting the norm would be best. Any help here is very appreciated.
Some simple fake data:
a = np.arange(9.).reshape(3,3)
s = sparse.csr_matrix(a)
To get the norm of each row from the sparse, you can use:
np.sqrt(s.multiply(s).sum(1))
And the renormalized s
would be
s.multiply(1/np.sqrt(s.multiply(s).sum(1)))
or to keep it sparse before renormalizing:
s.multiply(sparse.csr_matrix(1/np.sqrt(s.multiply(s).sum(1))))
To get ordinary matrix or array from it, use:
m = s.todense()
a = s.toarray()
If you have enough memory for the dense version, you can get the norm of each row with:
n = np.sqrt(np.einsum('ij,ij->i',a,a))
or
n = np.apply_along_axis(np.linalg.norm, 1, a)
To normalize, you can do
an = a / n[:, None]
or, to normalize the original array in place:
a /= n[:, None]
The [:, None]
thing basically transposes n
to be a vertical array.
scipy.sparse
is a great package, and it keeps getting better with every release, but a lot of things are still only half cooked, and you can get big performance improvements if you implement some of the algorithms yourself. For instance, a 7x improvement over @askewchan's implementation using scipy functions:
In [18]: a = sps.rand(1000, 1000, format='csr')
In [19]: a
Out[19]:
<1000x1000 sparse matrix of type '<type 'numpy.float64'>'
with 10000 stored elements in Compressed Sparse Row format>
In [20]: %timeit a.multiply(a).sum(1)
1000 loops, best of 3: 288 us per loop
In [21]: %timeit np.add.reduceat(a.data * a.data, a.indptr[:-1])
10000 loops, best of 3: 36.8 us per loop
In [24]: np.allclose(a.multiply(a).sum(1).ravel(),
...: np.add.reduceat(a.data * a.data, a.indptr[:-1]))
Out[24]: True
You can similarly normalize the array in place doing the following:
norm_rows = np.sqrt(np.add.reduceat(a.data * a.data, a.indptr[:-1]))
nnz_per_row = np.diff(a.indptr)
a.data /= np.repeat(norm_rows, nnz_per_row)
If you are going to be using sparse matrices often, read the wikipedia page on compressed sparse formats, and you will often find better ways than the default to do things.