Python의 2차 판별 분석(단계별)

에 의해 벤자민 앤더슨 7월 27, 2023 가이드 댓글 0개

2차 판별 분석은 일련의 예측 변수가 있고 응답 변수를 두 개 이상의 클래스로 분류하려는 경우 사용할 수 있는 방법입니다.

이는 선형 판별 분석 과 비선형적으로 동등한 것으로 간주됩니다.

이 튜토리얼에서는 Python에서 2차 판별 분석을 수행하는 방법에 대한 단계별 예를 제공합니다.

1단계: 필요한 라이브러리 로드

먼저 이 예제에 필요한 함수와 라이브러리를 로드합니다.

 from sklearn. model_selection import train_test_split
from sklearn. model_selection import RepeatedStratifiedKFold
from sklearn. model_selection import cross_val_score
from sklearn. discriminant_analysis import QuadraticDiscriminantAnalysis 
from sklearn import datasets
import matplotlib. pyplot as plt
import pandas as pd
import numpy as np

2단계: 데이터 로드

이 예에서는 sklearn 라이브러리의 iris 데이터 세트를 사용합니다. 다음 코드는 이 데이터 세트를 로드하고 사용하기 쉽도록 Pandas DataFrame으로 변환하는 방법을 보여줍니다.

 #load iris dataset
iris = datasets. load_iris ()

#convert dataset to pandas DataFrame
df = pd.DataFrame(data = np.c_[iris[' data '], iris[' target ']],
                 columns = iris[' feature_names '] + [' target '])
df[' species '] = pd. Categorical . from_codes (iris.target, iris.target_names)
df.columns = [' s_length ', ' s_width ', ' p_length ', ' p_width ', ' target ', ' species ']

#view first six rows of DataFrame
df. head ()

   s_length s_width p_length p_width target species
0 5.1 3.5 1.4 0.2 0.0 setosa
1 4.9 3.0 1.4 0.2 0.0 setosa
2 4.7 3.2 1.3 0.2 0.0 setosa
3 4.6 3.1 1.5 0.2 0.0 setosa
4 5.0 3.6 1.4 0.2 0.0 setosa

#find how many total observations are in dataset
len(df.index)

150

데이터세트에 총 150개의 관측값이 포함되어 있음을 알 수 있습니다.

이 예에서는 주어진 꽃이 속하는 종을 분류하기 위한 2차 판별 분석 모델을 구축하겠습니다.

모델에서 다음 예측 변수를 사용합니다.

꽃받침 길이
꽃받침 너비
꽃잎 길이
꽃잎 폭

그리고 이를 사용하여 다음 세 가지 잠재적 클래스를 지원하는 종 응답 변수를 예측합니다.

세토사
베르시컬러
여자 이름

3단계: QDA 모델 조정

다음으로 sklearn의 QuadraticDiscriminantAnalsys 기능을 사용하여 QDA 모델을 데이터에 적용합니다.

 #define predictor and response variables
X = df[[' s_length ',' s_width ',' p_length ',' p_width ']]
y = df[' species ']

#Fit the QDA model
model = QuadraticDiscriminantAnalysis()
model. fit (x,y)

4단계: 모델을 사용하여 예측하기

데이터를 사용하여 모델을 피팅한 후에는 반복된 계층화된 k-겹 교차 검증을 사용하여 모델의 성능을 평가할 수 있습니다.

이 예에서는 접기 10번과 반복 3번을 사용합니다.

 #Define method to evaluate model
cv = RepeatedStratifiedKFold(n_splits= 10 , n_repeats= 3 , random_state= 1 )

#evaluate model
scores = cross_val_score(model, X, y, scoring=' accuracy ', cv=cv, n_jobs=-1)
print( np.mean (scores))  

0.97333333333334

모델이 97.33% 의 평균 정확도를 달성한 것을 볼 수 있습니다.

또한 모델을 사용하여 입력 값을 기반으로 새 꽃이 어떤 클래스에 속하는지 예측할 수 있습니다.

 #define new observation
new = [5, 3, 1, .4]

#predict which class the new observation belongs to
model. predict ([new])

array(['setosa'], dtype='<U10')

모델은 이 새로운 관찰이 setosa 라는 종에 속한다고 예측하는 것을 볼 수 있습니다.

이 튜토리얼에서 사용된 전체 Python 코드는 여기에서 찾을 수 있습니다.

저자 소개

벤자민 앤더슨

안녕하세요. 저는 통계학 교수를 퇴직하고 전임 통계 교사로 변신한 벤자민입니다. 통계 분야의 광범위한 경험과 전문 지식을 바탕으로 Statorials를 통해 학생들에게 힘을 실어주기 위해 지식을 공유하고 싶습니다. 더 알아보기