如何在训练中分割数据& r 中的测试集（3 种方法）

经过本杰明·安德森博 7月 19, 2023 指导 0 条评论

通常，当我们将机器学习算法应用于数据集时，我们首先将数据集分为训练集和测试集。

在 R 中，将数据拆分为训练集和测试集的常用方法有以下三种：

方法一：使用Base R

 #make this example reproducible
set. seeds (1)

#use 70% of dataset as training set and 30% as test set
sample <- sample(c( TRUE , FALSE ), nrow(df), replace= TRUE , prob=c( 0.7 , 0.3 ))
train <- df[sample, ]
test <- df[!sample, ]

方法2：使用caTools包

 library (caTools)

#make this example reproducible
set. seeds (1)

#use 70% of dataset as training set and 30% as test set
sample <- sample. split (df$any_column_name, SplitRatio = 0.7 )
train <- subset(df, sample == TRUE )
test <- subset(df, sample == FALSE )

方法3：使用dplyr包

 library (dplyr)

#make this example reproducible
set. seeds (1)

#create ID column
df$id <- 1:nrow(df)

#use 70% of dataset as training set and 30% as test set
train <- df %>% dplyr::sample_frac( 0.70 )
test <- dplyr::anti_join(df, train, by = ' id ')

以下示例展示了如何在实践中使用 R 中内置的iris 数据集来使用每种方法。

示例 1：使用 Base R 将数据拆分为训练集和测试集

以下代码展示了如何使用 R 库将 iris 数据集拆分为训练集和测试集，使用 70% 的行作为训练集，剩余的 30% 作为测试集：

 #load iris dataset
data(iris)

#make this example reproducible
set. seeds (1)

#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c( TRUE , FALSE ), nrow(iris), replace= TRUE , prob=c( 0.7 , 0.3 ))
train <- iris[sample, ]
test <- iris[!sample, ]

#view dimensions of training set
sun(train)

[1] 106 5

#view dimensions of test set
dim(test)

[1] 44 5

从结果我们可以看出：

训练集是一个106行5列的数据框。
测试的是44行5列的数据块。

由于原始数据库总共有 150 行，因此训练集大约包含原始行的 106/150 = 70.6%。

如果需要，我们还可以显示训练集的前几行：

 #view first few rows of training set
head(train)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa

示例 2：使用 caTools 将数据拆分为训练集和测试集

以下代码展示了如何使用 R 中的caTools包将 iris 数据集拆分为训练集和测试集，使用 70% 的行作为训练集，其余 30% 作为测试集：

 library (caTools)

#load iris dataset
data(iris)

#make this example reproducible
set. seeds (1)

#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample. split (iris$Species, SplitRatio = 0.7 )
train <- subset(iris, sample == TRUE )
test <- subset(iris, sample == FALSE )

#view dimensions of training set
sun(train)

[1] 105 5

#view dimensions of test set
dim(test)

[1] 45 5

从结果我们可以看出：

训练集是一个105行5列的数据框。
测试的是45行5列的数据块。

示例 3：使用 dplyr 将数据拆分为训练集和测试集

以下代码展示了如何使用 R 中的caTools包将 iris 数据集拆分为训练集和测试集，使用 70% 的行作为训练集，其余 30% 作为测试集：

 library (dplyr)

#load iris dataset
data(iris)

#make this example reproducible
set. seeds (1)

#create variable ID
iris$id <- 1:nrow(iris)

#Use 70% of dataset as training set and remaining 30% as testing set 
train <- iris %>% dplyr::sample_frac( 0.7 )
test <- dplyr::anti_join(iris, train, by = ' id ')

#view dimensions of training set
sun(train)

[1] 105 6

#view dimensions of test set
dim(test)

[1] 45 6

从结果我们可以看出：

训练集是105行6列的数据框。
测试的是45行6列的数据块。

请注意，这些训练和测试集包含我们创建的附加“id”列。

确保在调整机器学习算法时不使用此列（或将其从数据框中完全删除）。

其他资源

以下教程解释了如何在 R 中执行其他常见操作：

如何在R中计算MSE
如何在 R 中计算 RMSE
如何计算 R 中调整后的 R 平方

关于作者

本杰明·安德森博

大家好，我是本杰明，一位退休的统计学教授，后来成为 Statorials 的热心教师。凭借在统计领域的丰富经验和专业知识，我渴望分享我的知识，通过 Statorials 增强学生的能力。了解更多

示例 1：使用 Base R 将数据拆分为训练集和测试集

示例 2：使用 caTools 将数据拆分为训练集和测试集

示例 3：使用 dplyr 将数据拆分为训练集和测试集

其他资源

关于作者

本杰明·安德森博

添加评论