วิธีสร้างชุดฝึกและชุดทดสอบจาก pandas dataframe

โดย ดร.เบนจามิน แอนเดอร์สัน กรกฎาคม 19, 2023 แนะนำ 0 ความคิดเห็น

เมื่อปรับ โมเดลการเรียนรู้ของเครื่อง เข้ากับชุดข้อมูล เรามักจะแบ่งชุดข้อมูลออกเป็นสองชุด:

1. ชุดฝึกอบรม : ใช้ในการฝึกโมเดล (70-80% ของชุดข้อมูลดั้งเดิม)

2. ชุดทดสอบ: ใช้เพื่อให้ได้ค่าประมาณประสิทธิภาพของโมเดลที่เป็นกลาง (20-30% ของชุดข้อมูลดั้งเดิม)

ใน Python มีสองวิธีทั่วไปในการแบ่ง DataFrame ของ pandas ออกเป็นชุดการฝึกและชุดทดสอบ:

วิธีที่ 1: ใช้ train_test_split() ของ sklearn

 from sklearn. model_selection import train_test_split

train, test = train_test_split(df, test_size= 0.2 , random_state= 0 )

วิธีที่ 2: ใช้ตัวอย่าง () จากแพนด้า

 train = df. sample (frac= 0.8 , random_state= 0 )
test = df. drop ( train.index )

ตัวอย่างต่อไปนี้แสดงวิธีการใช้แต่ละวิธีกับ DataFrame แพนด้าต่อไปนี้:

 import pandas as pd
import numpy as np

#make this example reproducible
n.p. random . seeds (1)

#create DataFrame with 1,000 rows and 3 columns
df = pd. DataFrame ( {' x1 ': np.random.randint (30,size=1000),
                   ' x2 ': np. random . randint (12, size=1000),
                   ' y ': np. random . randint (2, size=1000)})

#view first few rows of DataFrame
df. head ()

        x1 x2 y
0 5 1 1
1 11 8 0
2 12 4 1
3 8 7 0
4 9 0 0

ตัวอย่างที่ 1: ใช้ train_test_split() จาก sklearn

รหัสต่อไปนี้แสดงวิธีใช้ฟังก์ชัน train_test_split() ของ sklearn เพื่อแบ่ง DataFrame ของ pandas ออกเป็นชุดการฝึกอบรมและชุดทดสอบ:

 from sklearn. model_selection import train_test_split

#split original DataFrame into training and testing sets
train, test = train_test_split(df, test_size= 0.2 , random_state= 0 )

#view first few rows of each set
print ( train.head ())

     x1 x2 y
687 16 2 0
500 18 2 1
332 4 10 1
979 2 8 1
817 11 1 0

print ( test.head ())

     x1 x2 y
993 22 1 1
859 27 6 0
298 27 8 1
553 20 6 0
672 9 2 1

#print size of each set
print (train. shape , test. shape )

(800, 3) (200, 3)

จากผลลัพธ์เราจะพบว่ามีการสร้างชุดขึ้นมา 2 ชุด:

ชุดฝึก: 800 แถว 3 คอลัมน์
ชุดทดสอบ: 200 แถวและ 3 คอลัมน์

โปรดทราบว่า test_size ควบคุมเปอร์เซ็นต์ของการสังเกตจาก DataFrame ดั้งเดิมที่จะเป็นของชุดทดสอบ และค่า Random_state ทำให้การแยกทำซ้ำได้

ตัวอย่างที่ 2: ใช้ตัวอย่าง () จากแพนด้า

รหัสต่อไปนี้แสดงวิธีใช้ฟังก์ชัน pandas example() เพื่อแยก DataFrame ของ pandas ออกเป็นชุดการฝึกอบรมและการทดสอบ:

 #split original DataFrame into training and testing sets
train = df. sample (frac= 0.8 , random_state= 0 )
test = df. drop ( train.index )

#view first few rows of each set
print ( train.head ())

     x1 x2 y
993 22 1 1
859 27 6 0
298 27 8 1
553 20 6 0
672 9 2 1

print ( test.head ())

    x1 x2 y
9 16 5 0
11 12 10 0
19 5 9 0
23 28 1 1
28 18 0 1

#print size of each set
print (train. shape , test. shape )

(800, 3) (200, 3)

จากผลลัพธ์เราจะพบว่ามีการสร้างชุดขึ้นมา 2 ชุด:

ชุดฝึก: 800 แถว 3 คอลัมน์
ชุดทดสอบ: 200 แถวและ 3 คอลัมน์

โปรดทราบว่า frac ควบคุมเปอร์เซ็นต์ของการสังเกตจาก DataFrame ดั้งเดิมที่จะเป็นของชุดการฝึก และค่า Random_state ทำให้การแยกทำซ้ำได้

แหล่งข้อมูลเพิ่มเติม

บทช่วยสอนต่อไปนี้จะอธิบายวิธีทำงานทั่วไปอื่นๆ ใน Python:

วิธีการดำเนินการถดถอยโลจิสติกใน Python
วิธีสร้างเมทริกซ์ความสับสนใน Python
วิธีการคำนวณความแม่นยำที่สมดุลใน Python

เกี่ยวกับผู้แต่ง

ดร.เบนจามิน แอนเดอร์สัน

สวัสดี ฉันชื่อเบนจามิน ศาสตราจารย์สถิติเกษียณอายุแล้ว และผันตัวมาเป็นครูสอนสถิติโดยเฉพาะ ด้วยประสบการณ์และความเชี่ยวชาญที่กว้างขวางในสาขาสถิติ ฉันกระตือรือร้นที่จะแบ่งปันความรู้ของฉันเพื่อเสริมศักยภาพนักเรียนผ่าน Statorials. รู้เพิ่มเติม

ตัวอย่างที่ 1: ใช้ train_test_split() จาก sklearn

ตัวอย่างที่ 2: ใช้ตัวอย่าง () จากแพนด้า

แหล่งข้อมูลเพิ่มเติม

เกี่ยวกับผู้แต่ง

ดร.เบนจามิน แอนเดอร์สัน

เพิ่มความคิดเห็น