Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
406 views
in Technique[技术] by (71.8m points)

python - Scipy coo_matrix.max() alters data attribute

I am building a recommendation system using an open source library, LightFM. This library requires certain pieces of data to be in a sparse matrix format, specifically the scipy coo_matrix. It is here that I am encountering strange behavior. It seems like a bug, but it's more likely that I am doing something wrong.

Basically, I let LightFM.Dataset build me a sparse matrix, like so:

interactions, weights = dataset.build_interactions(data=_get_interactions_data())

The method, build_interactions, returns "Two COO matrices: the interactions matrix and the corresponding weights matrix" -- LightFM Official Doc.

When I inspect the contents of this sparse matrix (in practice, I use a debugger), like so:

for i in interactions.data:
    print(i, end=', ')

1, 1, 1, 1, 1, ....

It prints a long list of 1s, which indicates that the sparse matrix's nonzero elements are only 1s.

However, when I first check the max of the sparse matrix, it indicates that the maximum values in the sparse matrix is not a 1, its a 3. Furthermore, printing the matrix after that check will print a long list of 1s, 2s, and 3s. This is the code for that:

print(interactions.max())
for i in interactions.data:
    print(i, end=', ')

3
1, 1, 3, 2, 1, 2, ...

Any idea what is going on here? Python is 3.6.8. Scipy is 1.5.4. CentOS7.

Thank you.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

A 'raw' coo_matrix can have duplicate elements (repeats of the same row and col values), but when converted to csr format for calculations those duplicates are summed. It must be doing the same, but in-place, in order to find that max.

In [9]: from scipy import sparse
In [10]: M = sparse.coo_matrix(([1,1,1,1,1,1],([0,0,0,0,0,0],[0,0,1,0,1,2])))
In [11]: M.data
Out[11]: array([1, 1, 1, 1, 1, 1])
In [12]: M.max()
Out[12]: 3
In [13]: M.data
Out[13]: array([3, 2, 1])
In [14]: M
Out[14]: 
<1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in COOrdinate format>

Tracing through the max code I find it uses sum_duplicates

In [33]: M = sparse.coo_matrix(([1,1,1,1,1,1],([0,0,0,0,0,0],[0,0,1,0,1,2])))
In [34]: M.data
Out[34]: array([1, 1, 1, 1, 1, 1])
In [35]: M.sum_duplicates?
Signature: M.sum_duplicates()
Docstring:
Eliminate duplicate matrix entries by adding them together

This is an *in place* operation
File:      /usr/local/lib/python3.8/dist-packages/scipy/sparse/coo.py
Type:      method
In [36]: M.sum_duplicates()
In [37]: M.data
Out[37]: array([3, 2, 1])

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...