python - 使用StratifiedKFold创建火车/测试/车票拆分(Creating train/test/val split with StratifiedKFold)

Question

Welcome To Ask or Share your Answers For Others

python - 使用StratifiedKFold创建火车/测试/车票拆分(Creating train/test/val split with StratifiedKFold)

asked Mar 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - 使用StratifiedKFold创建火车/测试/车票拆分(Creating train/test/val split with StratifiedKFold)

I'm trying to use StratifiedKFold to create train/test/val splits for use in a non-sklearn machine learning work flow.

(我正在尝试使用StratifiedKFold来创建训练/测试/ val拆分，以用于非sklearn机器学习工作流程。)

So, the DataFrame needs to be split and then stay that way.

(因此，需要拆分DataFrame，然后再保持这种状态。)

I'm trying to do it like the following, using .values because I'm passing pandas DataFrames:

(我正在尝试使用.values进行以下操作，因为我正在传递pandas DataFrames：)

skf = StratifiedKFold(n_splits=3, shuffle=False)
skf.get_n_splits(X, y)

for train_index, test_index, valid_index in skf.split(X.values, y.values):
    print("TRAIN:", train_index, "TEST:", test_index,  "VALID:", valid_index)
    X_train, X_test, X_valid = X.values[train_index], X.values[test_index], X.values[valid_index]
    y_train, y_test, y_valid = y.values[train_index], y.values[test_index], y.values[valid_index]

This fails with:

(失败的原因是：)

ValueError: not enough values to unpack (expected 3, got 2).

I read through all of the sklearn docs and ran the example code, but did not gain a better understanding of how to use stratified k fold splits outside of a sklearn cross-validation scenario.

(我通读了所有sklearn文档并运行了示例代码，但没有更好地理解如何在sklearn交叉验证方案之外使用分层的k倍拆分。)

EDIT:

(编辑：)

I also tried like this:

(我也这样尝试过：)

# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)

# Create validation split from train split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.05)

Which seems to work, although I imagine I'm messing with the stratification by doing so.

(这似乎可行，尽管我想我这样做会弄乱分层。)

ask by tw0000 translate from so

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-03-06T04:26:02+0000

StratifiedKFold can only be used to split your dataset into two parts per fold.

(StratifiedKFold只能用于将数据集每折分为两部分。)

You are getting an error because the split() method will only yield a tuple of train_index and test_index (see https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py#L94 ).

(您会收到一个错误消息，因为split()方法将只生成一个train_index和test_index的元组（请参阅https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py# L94 ）。)

For this use case you should first split your data into validation and rest, and then split the rest again into test and train like such:

(对于此用例，您应该首先将数据分为验证和其余部分，然后将其余部分再次分为测试和训练，如下所示：)

X_rest, X_val, y_rest, y_val = train_test_split(X, y, test_size=0.2, train_size=0.8, stratify=y)
X_train, X_test, y_train, y_test = train_test_split(X_rest, y_rest, test_size=0.25, train_size=0.75, stratify=y_rest)

Categories

python - 使用StratifiedKFold创建火车/测试/车票拆分(Creating train/test/val split with StratifiedKFold)

python - 使用StratifiedKFold创建火车/测试/车票拆分(Creating train/test/val split with StratifiedKFold)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags