Pandas dataframe からトレーニングセットとテストセットを作成する方法

によるベンジャミン・アンダーソン博士 7月 19, 2023 ガイド 0コメント

機械学習モデルをデータセットに適合させるとき、多くの場合、データセットを 2 つのセットに分割します。

1. トレーニングセット:モデルのトレーニングに使用されます (元のデータセットの 70 ～ 80%)

2. テストセット:モデルのパフォーマンスの不偏推定値を取得するために使用されます (元のデータセットの 20 ～ 30%)

Python では、pandas DataFrame をトレーニングセットとテストセットに分割する一般的な方法が 2 つあります。

方法 1: sklearn の train_test_split() を使用する

 from sklearn. model_selection import train_test_split

train, test = train_test_split(df, test_size= 0.2 , random_state= 0 )

方法2: pandasのsample()を使用する

 train = df. sample (frac= 0.8 , random_state= 0 )
test = df. drop ( train.index )

次の例は、次の pandas DataFrame で各メソッドを使用する方法を示しています。

 import pandas as pd
import numpy as np

#make this example reproducible
n.p. random . seeds (1)

#create DataFrame with 1,000 rows and 3 columns
df = pd. DataFrame ( {' x1 ': np.random.randint (30,size=1000),
                   ' x2 ': np. random . randint (12, size=1000),
                   ' y ': np. random . randint (2, size=1000)})

#view first few rows of DataFrame
df. head ()

        x1 x2 y
0 5 1 1
1 11 8 0
2 12 4 1
3 8 7 0
4 9 0 0

例 1: sklearn の train_test_split() を使用する

次のコードは、 sklearnのtrain_test_split()関数を使用して、pandas DataFrame をトレーニングセットとテストセットに分割する方法を示しています。

 from sklearn. model_selection import train_test_split

#split original DataFrame into training and testing sets
train, test = train_test_split(df, test_size= 0.2 , random_state= 0 )

#view first few rows of each set
print ( train.head ())

     x1 x2 y
687 16 2 0
500 18 2 1
332 4 10 1
979 2 8 1
817 11 1 0

print ( test.head ())

     x1 x2 y
993 22 1 1
859 27 6 0
298 27 8 1
553 20 6 0
672 9 2 1

#print size of each set
print (train. shape , test. shape )

(800, 3) (200, 3)

結果から、2 つのセットが作成されたことがわかります。

トレーニングセット: 800 行 3 列
テストセット: 200 行 3 列

test_size は、テストセットに属する元の DataFrame からの観測値のパーセンテージを制御し、 random_state値により分割が再現可能になることに注意してください。

例2: pandasからsample()を使用する

次のコードは、 pandas sample()関数を使用して、pandas DataFrame をトレーニングセットとテストセットに分割する方法を示しています。

 #split original DataFrame into training and testing sets
train = df. sample (frac= 0.8 , random_state= 0 )
test = df. drop ( train.index )

#view first few rows of each set
print ( train.head ())

     x1 x2 y
993 22 1 1
859 27 6 0
298 27 8 1
553 20 6 0
672 9 2 1

print ( test.head ())

    x1 x2 y
9 16 5 0
11 12 10 0
19 5 9 0
23 28 1 1
28 18 0 1

#print size of each set
print (train. shape , test. shape )

(800, 3) (200, 3)

結果から、2 つのセットが作成されたことがわかります。

トレーニングセット: 800 行 3 列
テストセット: 200 行 3 列

frac はトレーニングセットに属する元の DataFrame からの観測のパーセンテージを制御し、 random_state値により分割が再現可能になることに注意してください。

追加リソース

次のチュートリアルでは、Python で他の一般的なタスクを実行する方法について説明します。

Python でロジスティック回帰を実行する方法
 Python で混同行列を作成する方法
 Python でバランスのとれた精度を計算する方法

著者について

ベンジャミン・アンダーソン博士

私はベンジャミンです。退職した統計教授から、専任の Statorials 教育者になりました。統計分野における豊富な経験と専門知識を活かして、私は Statorials を通じて学生に力を与えるために自分の知識を共有することに尽力しています。もっと知る

例 1: sklearn の train_test_split() を使用する

例2: pandasからsample()を使用する

追加リソース

著者について

ベンジャミン・アンダーソン博士

コメントを追加する