python - Worse results when training on entire dataset

Question

Welcome To Ask or Share your Answers For Others

python - Worse results when training on entire dataset

asked Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Worse results when training on entire dataset

After finalizing the architecture of my model I decided to train the model on the entire dataset by setting validation_split = 0 in fit(). I thought this would improve the results based on these sources:

What is validation data used for in a Keras Sequential model?

Your model doesn't "see" your validation set and isn′t in any way trained on it

https://machinelearningmastery.com/train-final-machine-learning-model/

What about the cross-validation models or the train-test datasets?

They’ve been discarded. They are no longer needed.
They have served their purpose to help you choose a procedure to finalize.

However, I got worse results without the validation set (compared to validation_split = 0.2), leaving all other parameters the same.

Is there an explanation for this? Or was it just by chance that my model happened to perform better on the fixed test data when a part of the training data was excluded (and used as validation).

question from:https://stackoverflow.com/questions/66046148/worse-results-when-training-on-entire-dataset

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T03:16:33+0000

Well that's really a very good question that covers a lots of machine learning related concepts specially Bias-Variance Tradeoff

As in the comment @CrazyBarzillian hinted that more data might be leading to over-fitting and yes we need more info about your data to come to a solution. But in a broader way I would like to explain you few points, that might help you to understand as it why it happened.

EXPLAINATION

Whenever your data has more number of features, your model learns a very complex equation to solve it. In short model is too complicated for the amount of data we have. This situation, known as high variance, leads to model over-fitting. We know that we are facing a high variance issue when the training error is much lower than the test error. High variance problems can be addressed by reducing the number of features (by applying PCA , outlier removal etc.), by increasing the number of data points that is adding more data.
Sometimes, you have lesser features in your data and hence model learns a very simple equation to solve it. This is known as high bias. In this case , adding more data won't help. In this case less data will do the work or adding more features will help.

MY ASSUMPTION

I guess your model is suffering from high bias if its performing poor on adding more data. But to check whether the statement adding more data leading to poor results is correct or not in your case you can do the following things:-

play with some hyperparameters
try other machine learning models
instead of accuracy scores , look for r2 scores or mean absolute error in case of regression or F1, precision, recall in case of classification

If after doing both things you are still getting the same results that more data is leading to poor results, then you can be sure of high bias and can either increase the number of features or reduce the data.

SOLUTION

By reducing the data, I mean use small data but better data. Better data means suppose you are doing a classification problem and you have three classes (A, B and C) , a better data would be if all the datapoints are balanced between three classes. Your data should be balanced. If it is unbalanced that is class A has high number of samples while class B and C has only 3-4 samples then you can apply Ramdom Sampling techniques to overcome it.

How to make BETTER DATA

Balance the data
Remove outliers
Scale (Normalize) the data

CONCLUSION

It is a myth that more data is always leads to good model. Actually more than the quantity , quality of the data also matters. Data should have both quantity as well as quality. This game of maintaining quality and quantity is known as Bias-Variance Tradeoff.

Categories

python - Worse results when training on entire dataset

python - Worse results when training on entire dataset

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags