如何从 pandas dataframe 创建训练集和测试集

经过本杰明·安德森博 19 7 月, 2023 指导 0 条评论

当将机器学习模型拟合到数据集时，我们通常将数据集分为两组：

1.训练集：用于训练模型（原始数据集的70-80%）

2.测试集：用于获得模型性能的无偏估计（原始数据集的20-30%）

在Python中，有两种常见的方法可以将pandas DataFrame拆分为训练集和测试集：

方法一：使用sklearn的train_test_split()

 from sklearn. model_selection import train_test_split

train, test = train_test_split(df, test_size= 0.2 , random_state= 0 )

方法2：使用pandas中的sample()

 train = df. sample (frac= 0.8 , random_state= 0 )
test = df. drop ( train.index )

以下示例展示了如何将每种方法与以下 pandas DataFrame 一起使用：

 import pandas as pd
import numpy as np

#make this example reproducible
n.p. random . seeds (1)

#create DataFrame with 1,000 rows and 3 columns
df = pd. DataFrame ( {' x1 ': np.random.randint (30,size=1000),
                   ' x2 ': np. random . randint (12, size=1000),
                   ' y ': np. random . randint (2, size=1000)})

#view first few rows of DataFrame
df. head ()

        x1 x2 y
0 5 1 1
1 11 8 0
2 12 4 1
3 8 7 0
4 9 0 0

示例 1：使用 sklearn 中的 train_test_split()

以下代码展示了如何使用sklearn的train_test_split()函数将 pandas DataFrame 拆分为训练集和测试集：

 from sklearn. model_selection import train_test_split

#split original DataFrame into training and testing sets
train, test = train_test_split(df, test_size= 0.2 , random_state= 0 )

#view first few rows of each set
print ( train.head ())

     x1 x2 y
687 16 2 0
500 18 2 1
332 4 10 1
979 2 8 1
817 11 1 0

print ( test.head ())

     x1 x2 y
993 22 1 1
859 27 6 0
298 27 8 1
553 20 6 0
672 9 2 1

#print size of each set
print (train. shape , test. shape )

(800, 3) (200, 3)

从结果中我们可以看到创建了两个集合：

训练集：800行3列
测试集：200行3列

请注意， test_size控制原始 DataFrame 中属于测试集的观测值的百分比，而random_state值使分割可重现。

示例 2：使用 pandas 中的sample()

以下代码展示了如何使用pandas Sample()函数将 pandas DataFrame 拆分为训练集和测试集：

 #split original DataFrame into training and testing sets
train = df. sample (frac= 0.8 , random_state= 0 )
test = df. drop ( train.index )

#view first few rows of each set
print ( train.head ())

     x1 x2 y
993 22 1 1
859 27 6 0
298 27 8 1
553 20 6 0
672 9 2 1

print ( test.head ())

    x1 x2 y
9 16 5 0
11 12 10 0
19 5 9 0
23 28 1 1
28 18 0 1

#print size of each set
print (train. shape , test. shape )

(800, 3) (200, 3)

从结果中我们可以看到创建了两个集合：

训练集：800行3列
测试集：200行3列

请注意， frac控制原始 DataFrame 中属于训练集的观测值的百分比，并且random_state值使分割可重现。

其他资源

以下教程解释了如何在 Python 中执行其他常见任务：

如何在 Python 中执行逻辑回归
 如何在 Python 中创建混淆矩阵
 如何在Python中计算平衡精度

关于作者

本杰明·安德森博

大家好，我是本杰明，一位退休的统计学教授，后来成为 Statorials 的热心教师。凭借在统计领域的丰富经验和专业知识，我渴望分享我的知识，通过 Statorials 增强学生的能力。了解更多

示例 1：使用 sklearn 中的 train_test_split()

示例 2：使用 pandas 中的sample()

其他资源

关于作者

本杰明·安德森博

添加评论