You can use the built-in columnSimilarities()
method on a RowMatrix
, that can both calculate the exact cosine similarities, or estimate it using the DIMSUM method, which will be considerably faster for larger datasets. The difference in usage is that for the latter, you'll have to specify a threshold
.
Here's a small reproducible example:
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
# Output
exact.entries.collect()
[MatrixEntry(0, 2, 0.991935352214),
MatrixEntry(1, 2, 0.998441152599),
MatrixEntry(0, 1, 0.997463284056)]
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…