Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
300 views
in Technique[技术] by (71.8m points)

python - 使用StratifiedKFold创建火车/测试/车票拆分(Creating train/test/val split with StratifiedKFold)

I'm trying to use StratifiedKFold to create train/test/val splits for use in a non-sklearn machine learning work flow.

(我正在尝试使用StratifiedKFold来创建训练/测试/ val拆分,以用于非sklearn机器学习工作流程。)

So, the DataFrame needs to be split and then stay that way.

(因此,需要拆分DataFrame,然后再保持这种状态。)

I'm trying to do it like the following, using .values because I'm passing pandas DataFrames:

(我正在尝试使用.values进行以下操作,因为我正在传递pandas DataFrames:)

skf = StratifiedKFold(n_splits=3, shuffle=False)
skf.get_n_splits(X, y)

for train_index, test_index, valid_index in skf.split(X.values, y.values):
    print("TRAIN:", train_index, "TEST:", test_index,  "VALID:", valid_index)
    X_train, X_test, X_valid = X.values[train_index], X.values[test_index], X.values[valid_index]
    y_train, y_test, y_valid = y.values[train_index], y.values[test_index], y.values[valid_index]

This fails with:

(失败的原因是:)

ValueError: not enough values to unpack (expected 3, got 2).

I read through all of the sklearn docs and ran the example code, but did not gain a better understanding of how to use stratified k fold splits outside of a sklearn cross-validation scenario.

(我通读了所有sklearn文档并运行了示例代码,但没有更好地理解如何在sklearn交叉验证方案之外使用分层的k倍拆分。)

EDIT:

(编辑:)

I also tried like this:

(我也这样尝试过:)

# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y)

# Create validation split from train split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.05)

Which seems to work, although I imagine I'm messing with the stratification by doing so.

(这似乎可行,尽管我想我这样做会弄乱分层。)

  ask by tw0000 translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

StratifiedKFold can only be used to split your dataset into two parts per fold.

(StratifiedKFold只能用于将数据集每折分为两部分。)

You are getting an error because the split() method will only yield a tuple of train_index and test_index (see https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py#L94 ).

(您会收到一个错误消息,因为split()方法将只生成一个train_index和test_index的元组(请参阅https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/model_selection/_split.py# L94 )。)

For this use case you should first split your data into validation and rest, and then split the rest again into test and train like such:

(对于此用例,您应该首先将数据分为验证和其余部分,然后将其余部分再次分为测试和训练,如下所示:)

X_rest, X_val, y_rest, y_val = train_test_split(X, y, test_size=0.2, train_size=0.8, stratify=y)
X_train, X_test, y_train, y_test = train_test_split(X_rest, y_rest, test_size=0.25, train_size=0.75, stratify=y_rest)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...