Pandas dataframe에서 학습 및 테스트 세트를 만드는 방법

에 의해 벤자민 앤더슨 7월 19, 2023 가이드 댓글 0개

기계 학습 모델을 데이터 세트에 맞출 때 데이터 세트를 두 세트로 나누는 경우가 많습니다.

1. 훈련 세트: 모델을 훈련하는 데 사용됩니다(원래 데이터 세트의 70-80%)

2. 테스트 세트: 모델 성능의 편견 없는 추정치를 얻는 데 사용됩니다(원래 데이터 세트의 20-30%)

Python에는 Pandas DataFrame을 훈련 세트와 테스트 세트로 분할하는 두 가지 일반적인 방법이 있습니다.

방법 1: sklearn의 train_test_split() 사용

 from sklearn. model_selection import train_test_split

train, test = train_test_split(df, test_size= 0.2 , random_state= 0 )

방법 2: pandas의 샘플() 사용

 train = df. sample (frac= 0.8 , random_state= 0 )
test = df. drop ( train.index )

다음 예에서는 다음 Pandas DataFrame에서 각 메서드를 사용하는 방법을 보여줍니다.

 import pandas as pd
import numpy as np

#make this example reproducible
n.p. random . seeds (1)

#create DataFrame with 1,000 rows and 3 columns
df = pd. DataFrame ( {' x1 ': np.random.randint (30,size=1000),
                   ' x2 ': np. random . randint (12, size=1000),
                   ' y ': np. random . randint (2, size=1000)})

#view first few rows of DataFrame
df. head ()

        x1 x2 y
0 5 1 1
1 11 8 0
2 12 4 1
3 8 7 0
4 9 0 0

예 1: sklearn의 train_test_split() 사용

다음 코드는 sklearn 의 train_test_split() 함수를 사용하여 Pandas DataFrame을 훈련 및 테스트 세트로 분할하는 방법을 보여줍니다.

 from sklearn. model_selection import train_test_split

#split original DataFrame into training and testing sets
train, test = train_test_split(df, test_size= 0.2 , random_state= 0 )

#view first few rows of each set
print ( train.head ())

     x1 x2 y
687 16 2 0
500 18 2 1
332 4 10 1
979 2 8 1
817 11 1 0

print ( test.head ())

     x1 x2 y
993 22 1 1
859 27 6 0
298 27 8 1
553 20 6 0
672 9 2 1

#print size of each set
print (train. shape , test. shape )

(800, 3) (200, 3)

결과에서 두 개의 세트가 생성되었음을 확인할 수 있습니다.

훈련 세트: 행 800개와 열 3개
테스트 세트: 행 200개와 열 3개

test_size 는 테스트 세트에 속할 원본 DataFrame의 관측치 비율을 제어하고 random_state 값은 분할을 재현 가능하게 만듭니다.

예 2: Pandas의 Sample() 사용

다음 코드는 pandas 샘플() 함수를 사용하여 pandas DataFrame을 훈련 세트와 테스트 세트로 분할하는 방법을 보여줍니다.

 #split original DataFrame into training and testing sets
train = df. sample (frac= 0.8 , random_state= 0 )
test = df. drop ( train.index )

#view first few rows of each set
print ( train.head ())

     x1 x2 y
993 22 1 1
859 27 6 0
298 27 8 1
553 20 6 0
672 9 2 1

print ( test.head ())

    x1 x2 y
9 16 5 0
11 12 10 0
19 5 9 0
23 28 1 1
28 18 0 1

#print size of each set
print (train. shape , test. shape )

(800, 3) (200, 3)

결과에서 두 개의 세트가 생성되었음을 확인할 수 있습니다.

훈련 세트: 행 800개와 열 3개
테스트 세트: 행 200개와 열 3개

frac은 훈련 세트에 속하게 될 원본 DataFrame의 관측치 비율을 제어하고 Random_state 값은 분할을 재현 가능하게 만듭니다.

추가 리소스

다음 튜토리얼에서는 Python에서 다른 일반적인 작업을 수행하는 방법을 설명합니다.

Python에서 로지스틱 회귀를 수행하는 방법
Python에서 혼동 행렬을 만드는 방법
Python에서 균형 잡힌 정밀도를 계산하는 방법

저자 소개

벤자민 앤더슨

안녕하세요. 저는 통계학 교수를 퇴직하고 전임 통계 교사로 변신한 벤자민입니다. 통계 분야의 광범위한 경험과 전문 지식을 바탕으로 Statorials를 통해 학생들에게 힘을 실어주기 위해 지식을 공유하고 싶습니다. 더 알아보기

예 1: sklearn의 train_test_split() 사용

예 2: Pandas의 Sample() 사용

추가 리소스

저자 소개

벤자민 앤더슨

의견을 추가하다