使用项目相似的csr_matrix得到最相似的项目，以项目X，而无需转换csr_matrix到密集矩

我有一个购买数据（df_temp）。 我设法用熊猫数据框中使用稀疏csr_matrix，因为我有很多的产品（89000），我必须让他们的用户 - 项目信息（购买或不购买），然后计算产品之间的相似性来代替。

首先，我转换熊猫数据帧到numpy的数组：

 df_user_product = df_temp[['user_id','product_id']].copy()
 ar1 = np.array(df_user_product.to_records(index=False))

其次，创造了coo_matrix ，因为它知道在稀疏矩阵建设速度快。

 rows, r_pos = np.unique(ar1['product_id'], return_inverse=True)
 cols, c_pos = np.unique(ar1['user_id'], return_inverse=True)
 s = sparse.coo_matrix((np.ones(r_pos.shape,int), (r_pos, c_pos)))

第三，对于矩阵的计算，它最好使用csr_matrix或csc_matrix ，所以我用csr_matrix如我在行=>更有效行切片比csc_matrix的PRODUCT_ID（多个）。

    sparse_csr_mat = s.tocsr()
    sparse_csr_mat[sparse_csr_mat > 1] = 1

然后，我计算出的余弦相似产品之间，并把结果相似：

import sklearn.preprocessing as pp
col_normed_mat = pp.normalize(sparse_csr_mat, axis=1)
similarities = col_normed_mat * col_normed_mat.T

那就是：

<89447x89447 sparse matrix of type '<type 'numpy.float64'>'
    with 1332945 stored elements in Compressed Sparse Row format>

现在，我想有在最后一本字典，其中每个产品，有5种最相似的产品。怎么做？我不想稀疏矩阵转换为密集排列，因为内存的使用限制。但我也并不知道是否有像我们的数组，其中我们检查例如指数= PRODUCT_ID，并得到所有的行做访问csr_matrix的方式，其中指数= PRODUCT_ID，这样我会得到所有的同类产品PRODUCT_ID和排序余弦相似度值获得5个最相似的。

例如，在相似行矩阵：

(product_id1, product_id2) 0.45

如何筛选只（在我的情况下= 5）的X最相似的产品product_id1，而无需将矩阵转换为数组？

展望＃1 ，我觉得lil_matrix可以用于这种情况？怎么样？

谢谢您的帮助！

我终于明白我怎么能得到5个最相似的项目，以每个产品，这是通过使用.tolil()矩阵，然后对每一行转换为numpy的数组，并使用argsort获得5个最相似的项目。我用@hpaulj解决方案在此建议链接。

def max_n(row_data, row_indices, n):
        i = row_data.argsort()[-n:]
        # i = row_data.argpartition(-n)[-n:]
        top_values = row_data[i]
        top_indices = row_indices[i]  # do the sparse indices matter?

        return top_values, top_indices, i

然后我把它应用于一行来进行测试：

top_v, top_ind, ind = max_n(np.array(arr_ll.data[0]),np.array(arr_ll.rows[0]),5)

我需要的是top_indices这是5种最受同类产品的指标，但这些指标是不是真正的product_id 。我映射他们当我构建的coo_matrix

rows, r_pos = np.unique(ar1['product_id'], return_inverse=True)

可是如何才能让真正product_id从指标回来？

现在比如我有：

top_ind = [2 1 34 9 123]

如何知道2对应于什么product_id ， 1到什么，等等？

使用项目相似的csr_matrix得到最相似的项目，以项目X，而无需转换csr_matrix到密集矩

Answer 1:

收藏的人(0)

使用项目相似的csr_matrix得到最相似的项目，以项目X，而无需转换csr_matrix到密集矩

Answer 1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮