Train , Test 데이터 전처리를 위해 병합하는 방법 정리 !

728x90

데이터 전처리를 실행할 때, train 과 test데이터를 따로 전처리하면 스케일링이나 인코딩에 문제가 생기기 때문에

보통 train과 test 데이터를 합쳐서 전처리를 하고 다시 기존 index대로 나누는 방법을 사용하는데요

어떤 방법을 사용할 수 있는지 알아보겠습니다.

1. pd.concat + assign(indic)

여기서 indic은 indicator로 각 데이터에 test, train이라는 일종의 태그를 달아주는 방법입니다

df = pd.concat([test.assign(indic="test"), train.assign(indic="train")])

test, train = df[df["indic"].eq("test")], df[df["indic"].eq("train")]

DataFrame.eq는 말그대로 == 괄호 안의 값이 들어가 있는지를 확인하는 메서드 입니다

가령 df == 10 과 df.eq(10)은 동일한 표현입니다.

2. df [ '구분컬럼명' ] = '구분값'

임시로 새로운 컬럼을 만들어주는 방법입니다.

test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

test['type'] = "test"
train['type'] = "train"

df = pd.concat([test, train])

preprocess(df)

df.drop(['type'],axis = 1,inplace = True)

train = df[df['type'] == "train"]

test = df[df['type'] == "test"]

전처리를 다했다면 기존에 train, test로 분리해주면 됩니다

Source:

https://datascience.stackexchange.com/questions/81617/how-to-combine-and-separate-test-and-train-data-for-data-cleaning

How to combine and separate test and train data for data cleaning?

I am working on an ML model in which I have been provided the data in 2 files test.csv and train.csv. I want to perform data cleaning on both files together be concatenating them and then separatin...

datascience.stackexchange.com

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eq.html

728x90

'Data Science > Machine Learning' 카테고리의 다른 글

회귀 평가 지표 개념 A-Z 및 활용방법 이해하기 (0)	2022.10.19
K-Nearest Neighbors 최근접 이웃 알고리즘 정리 A-Z (0)	2022.10.19
카테고리형 데이터가 많을 시 고려사항 😮 (0)	2022.10.13
[ Scikit-learn ] compose.ColumnTransformer, make_column_transformer 정리 (0)	2022.10.13
Dataframe describe 기능 정리 (0)	2022.10.12

Data Speaks in Silence

Train , Test 데이터 전처리를 위해 병합하는 방법 정리 !

1. pd.concat + assign(indic)

2. df [ '구분컬럼명' ] = '구분값'

'Data Science > Machine Learning' 카테고리의 다른 글

댓글

티스토리툴바

Train , Test 데이터 전처리를 위해 병합하는 방법 정리 !

1. pd.concat + assign(indic)

2. df [ '구분컬럼명' ] = '구분값'

'Data Science > Machine Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바