トレーニングでデータを分割する方法 & r のテストセット (3 つのメソッド)

によるベンジャミン・アンダーソン博士 7月 19, 2023 ガイド 0コメント

多くの場合、機械学習アルゴリズムをデータセットに適応させるときは、まずデータセットをトレーニングセットとテストセットに分割します。

R でデータをトレーニングセットとテストセットに分割するには、次の 3 つの一般的な方法があります。

方法 1: Base R を使用する

 #make this example reproducible
set. seeds (1)

#use 70% of dataset as training set and 30% as test set
sample <- sample(c( TRUE , FALSE ), nrow(df), replace= TRUE , prob=c( 0.7 , 0.3 ))
train <- df[sample, ]
test <- df[!sample, ]

方法 2: caTools パッケージを使用する

 library (caTools)

#make this example reproducible
set. seeds (1)

#use 70% of dataset as training set and 30% as test set
sample <- sample. split (df$any_column_name, SplitRatio = 0.7 )
train <- subset(df, sample == TRUE )
test <- subset(df, sample == FALSE )

方法 3: dplyr パッケージを使用する

 library (dplyr)

#make this example reproducible
set. seeds (1)

#create ID column
df$id <- 1:nrow(df)

#use 70% of dataset as training set and 30% as test set
train <- df %>% dplyr::sample_frac( 0.70 )
test <- dplyr::anti_join(df, train, by = ' id ')

次の例は、R の組み込みiris データセットを使用して各メソッドを実際に使用する方法を示しています。

例 1: Base R を使用してデータをトレーニングセットとテストセットに分割する

次のコードは、R ベースを使用して、行の 70% をトレーニングセットとして使用し、残りの 30% をテストセットとして使用して、iris データセットをトレーニングセットとテストセットに分割する方法を示しています。

 #load iris dataset
data(iris)

#make this example reproducible
set. seeds (1)

#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c( TRUE , FALSE ), nrow(iris), replace= TRUE , prob=c( 0.7 , 0.3 ))
train <- iris[sample, ]
test <- iris[!sample, ]

#view dimensions of training set
sun(train)

[1] 106 5

#view dimensions of test set
dim(test)

[1] 44 5

結果から次のことがわかります。

トレーニングセットは 106 行 5 列のデータフレームです。
テストは 44 行 5 列のデータブロックです。

元のデータベースには合計 150 行があったため、トレーニングセットには元の行の約 106/150 = 70.6% が含まれます。

必要に応じて、トレーニングセットの最初の数行を表示することもできます。

 #view first few rows of training set
head(train)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa

例 2: caTools を使用してデータをトレーニングセットとテストセットに分割する

次のコードは、R でcaToolsパッケージを使用して、行の 70% をトレーニングセットとして使用し、残りの 30% をテストセットとして使用して、iris データセットをトレーニングセットとテストセットに分割する方法を示しています。

 library (caTools)

#load iris dataset
data(iris)

#make this example reproducible
set. seeds (1)

#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample. split (iris$Species, SplitRatio = 0.7 )
train <- subset(iris, sample == TRUE )
test <- subset(iris, sample == FALSE )

#view dimensions of training set
sun(train)

[1] 105 5

#view dimensions of test set
dim(test)

[1] 45 5

結果から次のことがわかります。

トレーニングセットは 105 行 5 列のデータフレームです。
テストは 45 行 5 列のデータブロックです。

例 3: dplyr を使用してデータをトレーニングセットとテストセットに分割する

 library (dplyr)

#load iris dataset
data(iris)

#make this example reproducible
set. seeds (1)

#create variable ID
iris$id <- 1:nrow(iris)

#Use 70% of dataset as training set and remaining 30% as testing set 
train <- iris %>% dplyr::sample_frac( 0.7 )
test <- dplyr::anti_join(iris, train, by = ' id ')

#view dimensions of training set
sun(train)

[1] 105 6

#view dimensions of test set
dim(test)

[1] 45 6

結果から次のことがわかります。

トレーニングセットは 105 行 6 列のデータフレームです。
テストは 45 行 6 列のデータブロックです。

これらのトレーニングセットとテストセットには、作成した追加の「id」列が含まれていることに注意してください。

機械学習アルゴリズムを調整するときは、この列を使用しないようにしてください (またはデータフレームから完全に削除してください)。

追加リソース

次のチュートリアルでは、R で他の一般的な操作を実行する方法について説明します。

R で MSE を計算する方法
 R で RMSE を計算する方法
 R の調整済み R 二乗を計算する方法

著者について

ベンジャミン・アンダーソン博士

私はベンジャミンです。退職した統計教授から、専任の Statorials 教育者になりました。統計分野における豊富な経験と専門知識を活かして、私は Statorials を通じて学生に力を与えるために自分の知識を共有することに尽力しています。もっと知る

例 1: Base R を使用してデータをトレーニング セットとテスト セットに分割する

例 2: caTools を使用してデータをトレーニング セットとテスト セットに分割する

例 3: dplyr を使用してデータをトレーニング セットとテスト セットに分割する

追加リソース

著者について

ベンジャミン・アンダーソン博士

コメントを追加する

例 1: Base R を使用してデータをトレーニングセットとテストセットに分割する

例 2: caTools を使用してデータをトレーニングセットとテストセットに分割する

例 3: dplyr を使用してデータをトレーニングセットとテストセットに分割する