Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
193 views
in Technique[技术] by (71.8m points)

machine learning - K-NN: training MSE with K=1 not equal to 0

In theory, the training MSE for k = 1 should be zero. However, the following script shows otherwise. I first generate some toy data: x represents sleeping hours and y represents happiness. Then I train the data and predict the outcome. Finally, I calculate the MSE for the training data via two methods. Can anyone tell me what goes wrong?

from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=1)

import numpy as np
x = np.array([7,8,6,7,5.7,6.8,8.6,6.5,7.8,5.7,9.8,7.7,8.8,6.2,7.1,5.7]).reshape(16,1)
y = np.array([5,7,4,5,6,9,7,6.8,8,7.6,9.3,8.2,7,6.2,3.8,6]).reshape(16,1)

model = model.fit(x,y)

for hours_slept in range(1,11):
    happiness = model.predict([[hours_slept]])
    print("if you sleep %.0f hours, you will be %.1f happy!" %(hours_slept, happiness))


# calculate MSE

# fast method
def model_mse(model,x,y):
    predictions = model.predict(x)
    return np.mean(np.power(y-predictions,2))
print(model_mse(model,x,y))

The output:

if you sleep 1 hours, you will be 6.0 happy!
if you sleep 2 hours, you will be 6.0 happy!
if you sleep 3 hours, you will be 6.0 happy!
if you sleep 4 hours, you will be 6.0 happy!
if you sleep 5 hours, you will be 6.0 happy!
if you sleep 6 hours, you will be 4.0 happy!
if you sleep 7 hours, you will be 5.0 happy!
if you sleep 8 hours, you will be 7.0 happy!
if you sleep 9 hours, you will be 7.0 happy!
if you sleep 10 hours, you will be 9.3 happy!
0.15999999999999992 #strictly larger than 0!
question from:https://stackoverflow.com/questions/66064161/k-nn-training-mse-with-k-1-not-equal-to-0

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

In theory, the training MSE for k = 1 should be zero

An implicit assumption here is that there are not duplicate samples x, or, to be precise, that same features x have same values y. Is it the case here? Let's see

pred = model.predict(x)

np.where(pred!=y)[0]
# array([9])

So, there is a single value where y and pred are indeed different:

y[9]
# array([7.6])

pred[9]
# array([6.])

where

x[9]
# array([5.7])

How many samples x have a value of 5.7, and what are the correspondent y's?

ind = np.where(x==5.7)[0]
ind
# array([ 4,  9, 15])

y[ind]
# result:
array([[6. ],
       [7.6],
       [6. ]])

pred[ind]
# result
array([[6.],
       [6.],
       [6.]])

So, what is actually happening here is that for x=5.7 the algorithm unsuprisingly cannot decide unambiguously which exact sample is the single closest neighbor - the one with y=6 or the one with y=7.6; and here it has chosen the one that does not coincide with the true y, leading to a non-zero MSE.

I guess that digging into the knn source code one would be able to justify exactly how such cases are handled internally, but I'm leaving this as an exercise.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...