While Anony-Mousse has some good points (Clustering is indeed not classifying) I think the ability of assigning new points has it's usefulness. *
Based on the original paper on DBSCAN and robertlaytons ideas on github.com/scikit-learn, I suggest running through core points and assigning to the cluster of the first core point that is within eps
of you new point.
Then it is guaranteed that your point will at least be a border point of the assigned cluster according to the definitions used for the clustering.
(Be aware that your point might be deemed noise and not assigned to a cluster)
I've done a quick implementation:
import numpy as np
import scipy as sp
def dbscan_predict(dbscan_model, X_new, metric=sp.spatial.distance.cosine):
# Result is noise by default
y_new = np.ones(shape=len(X_new), dtype=int)*-1
# Iterate all input samples for a label
for j, x_new in enumerate(X_new):
# Find a core sample closer than EPS
for i, x_core in enumerate(dbscan_model.components_):
if metric(x_new, x_core) < dbscan_model.eps:
# Assign label of x_core to x_new
y_new[j] = dbscan_model.labels_[dbscan_model.core_sample_indices_[i]]
break
return y_new
The labels obtained by clustering (dbscan_model = DBSCAN(...).fit(X)
and the labels obtained from the same model on the same data (dbscan_predict(dbscan_model, X)
) sometimes differ. I'm not quite certain if this is a bug somewhere or a result of randomness.
EDIT: I Think the above problem of differing prediction outcomes could stem from the possibility that a border point can be close to multiple clusters. Please update if you test this and find an answer. Ambiguity might be solved by shuffling core points every time or by picking the closest instead of the first core point.
*) Case at hand: I'd like to evaluate if the clusters obtained from a subset of my data makes sense for other subset or is simply a special case.
If it generalises it supports the validity of the clusters and the earlier steps of pre-processing applied.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…