[ Scikit-learn ] Train, test데이터 분리(train_test

728x90

데이터셋을 train 데이터와 test 데이터로 분리할 때 사용할 수 있는

sklearn.model_selection.train_test_split 과

sklearn.model_selection.StratifiedShuffleSplit에 대해서 알아보겠습니다 😊

✔ sklearn.model_selection.train_test_split

=> 주어진 array 혹은 matrice를 랜덤한 train, test 데이터로 나누는 방법

sklearn.model_selection.train_test_split(arrays, test_size=None, train_size=None, random_state=None,

shuffle=True, stratify=None)

test_size와 train_size는 어떤 비율만큼 각 데이터를 나누고 싶냐는 파라미터로 0에서 1 사이의 값을 입력하면 됩니다
train_size의 default 값은 0.25입니다
random_state는 관습적으로 42를 입력합니다

예시 코드)

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]

X와 y array를 각각 X_train, X_test, y_train, y_test 로 나눠보겠습니다

>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

전체 데이터셋의 1/3 비율을 제외한 2/3 정도가 train데이터에 잘 반영되었음을 알 수 있습니다

✔ sklearn.model_selection.StratifiedShuffleSplit

=> StratifiedShuffleSplit은 StratifiedKFold와 Shufflesplit이 합쳐진 형태로, 샘플에 포함된 각 계층의 비율이 반영된

계층화된 랜덤 분류를 수행한다는 차이점이 있습니다

sklearn.model_selection.StratifiedShuffleSplit(n_splits=10, *, test_size=None, train_size=None, random_state=None)

이 방법은 교차검증을 위해서 여러 n_splits으로 나누는 만큼 교차검증 횟수만큼의 train, test 조합이 반환됩니다

>>> import numpy as np
>>> from sklearn.model_selection import StratifiedShuffleSplit

>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 0, 1, 1, 1])

>>> sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
>>> sss.get_n_splits(X, y)
5

>>> for train_index, test_index in sss.split(X, y):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]

TRAIN: [5 2 3] TEST: [4 1 0]
TRAIN: [5 1 4] TEST: [0 2 3]
TRAIN: [5 0 2] TEST: [4 3 1]
TRAIN: [4 1 0] TEST: [2 3 5]
TRAIN: [0 5 1] TEST: [3 4 2]

만약 이렇게 반환된 train, test를 모델에 적용하고 싶다면 , 아래와 같이 반환된 index값을 통해 train, test를 재정의하고

fit과 predict를 수행하면됩니다


sss = StratifiedShuffleSplit(n_splits=4, test_size=0.5,random_state=0)
sss.get_n_splits(X, y)


rf = RandomForestClassifier(n_estimators=40, max_depth=7)
for train_index, test_index in sss.split(X, y):
	X_train, X_test = X[train_index], X[test_index]
	y_train, y_test = y[train_index], y[test_index]
	rf.fit(X_train, y_train)
	pred = rf.predict(X_test)

Source:

https://www.geeksforgeeks.org/sklearn-stratifiedshufflesplit-function-in-python/

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

728x90

'Data Science > Machine Learning' 카테고리의 다른 글

Train , Test 데이터 전처리를 위해 병합하는 방법 정리 ! (0)	2022.10.13
카테고리형 데이터가 많을 시 고려사항 😮 (0)	2022.10.13
[ Scikit-learn ] compose.ColumnTransformer, make_column_transformer 정리 (0)	2022.10.13
Dataframe describe 기능 정리 (0)	2022.10.12
[Scikit-learn] sklearn.feature_selection.SelectFrom Model (0)	2022.09.23

Data Speaks in Silence

[ Scikit-learn ] Train, test데이터 분리(train_test_split, StratifiedShuffleSplit)

✔ sklearn.model_selection.train_test_split

예시 코드)

✔ sklearn.model_selection.StratifiedShuffleSplit

'Data Science > Machine Learning' 카테고리의 다른 글

댓글

티스토리툴바

[ Scikit-learn ] Train, test데이터 분리(train_test_split, StratifiedShuffleSplit)

✔ sklearn.model_selection.train_test_split

예시 코드)

✔ sklearn.model_selection.StratifiedShuffleSplit

'Data Science > Machine Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바