如何在 r 中使用 createdatapartition() 函数

经过本杰明·安德森博 14 7 月, 2023 指导 0 条评论

您可以使用 R 中caret包的createDataPartition()函数将数据帧划分为训练集和测试集以进行模型构建。

该函数使用以下基本语法：

createDataPartition(y, 次数 = 1, p = 0.5, 列表 = TRUE, …)

金子：

y ：结果向量
times ：要创建的分区数
p ：训练集中使用的数据百分比
list : 是否将结果存储在列表中

下面的例子展示了如何在实际中使用这个功能。

示例：在 R 中使用 createDataPartition()

假设我们有一个 R 数据框，有 1,000 行，其中包含有关学生学习时间及其相应期末考试成绩的信息：

 #make this example reproducible
set. seeds (0)

#create data frame
df <- data. frame (hours=runif(1000, min=0, max=10),
                 score=runif(1000, min=40, max=100))

#view head of data frame
head(df)

     hours score
1 8.966972 55.93220
2 2.655087 71.84853
3 3.721239 81.09165
4 5.728534 62.99700
5 9.082078 97.29928
6 2.016819 47.10139

假设我们想要拟合一个简单的线性回归模型，该模型使用学习时间来预测期末考试成绩。

假设我们要在数据框中 80% 的行上训练模型，并在剩余 20% 的行上测试模型。

下面的代码展示了如何使用caret包的createDataPartition()函数将数据框划分为训练集和测试集：

 library (caret)

#partition data frame into training and testing sets
train_indices <- createDataPartition(df$score, times= 1 , p= .8 , list= FALSE )

#create training set
df_train <- df[train_indices, ]

#create testing set
df_test <- df[-train_indices, ]

#view number of rows in each set
nrow(df_train)

[1] 800

nrow(df_test)

[1] 200

我们可以看到我们的训练数据集包含 800 行，这是原始数据集的 80%。

同样，我们可以看到我们的测试数据集包含 200 行，这是原始数据集的 20%。

我们还可以可视化每组的第一行：

 #view head of training set
head(df_train)

     hours score
1 8.966972 55.93220
2 2.655087 71.84853
3 3.721239 81.09165
4 5.728534 62.99700
5 9.082078 97.29928
7 8.983897 42.34600

#view head of testing set
head(df_test)

      hours score
6 2.016819 47.10139
12 2.059746 96.67170
18 7.176185 92.61150
23 2.121425 89.17611
24 6.516738 50.47970
25 1.255551 90.58483

然后，我们可以继续使用训练集训练回归模型，并使用测试集评估其性能。

其他资源

以下教程解释了如何使用 R 中的其他常用函数：

如何在 R 中执行 K-Fold 交叉验证
 如何在 R 中执行多元线性回归
如何在 R 中执行逻辑回归

关于作者

本杰明·安德森博

大家好，我是本杰明，一位退休的统计学教授，后来成为 Statorials 的热心教师。凭借在统计领域的丰富经验和专业知识，我渴望分享我的知识，通过 Statorials 增强学生的能力。了解更多

示例：在 R 中使用 createDataPartition()

其他资源

关于作者

本杰明·安德森博

添加评论