R의 xgboost: 단계별 예

에 의해 벤자민 앤더슨 7월 27, 2023 가이드 댓글 0개

부스팅은 예측 정확도가 높은 모델을 생성하는 것으로 입증된 기계 학습 기술입니다.

실제로 부스팅을 구현하는 가장 일반적인 방법 중 하나는 “극단적 경사 부스팅”의 약자인 XGBoost를 사용하는 것입니다.

이 튜토리얼에서는 R에서 향상된 모델을 맞추기 위해 XGBoost를 사용하는 방법에 대한 단계별 예를 제공합니다.

1단계: 필요한 패키지 로드

먼저 필요한 라이브러리를 로드하겠습니다.

 library (xgboost) #for fitting the xgboost model
library (caret) #for general data preparation and model fitting

2단계: 데이터 로드

이 예에서는 MASS 패키지의 Boston 데이터 세트에 향상된 회귀 모델을 적용합니다.

이 데이터 세트에는 보스톤 주변의 다양한 인구 조사 구역에 있는 주택의 중앙값을 나타내는 mdev 라는 응답 변수 를 예측하는 데 사용할 13개의 예측 변수가 포함되어 있습니다.

 #load the data
data = MASS::Boston

#view the structure of the data
str(data) 

'data.frame': 506 obs. of 14 variables:
 $ crim: num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus: num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $chas: int 0 0 0 0 0 0 0 0 0 0 ...
 $ nox: num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $rm: num 6.58 6.42 7.18 7 7.15 ...
 $ age: num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis: num 4.09 4.97 4.97 6.06 6.06 ...
 $rad: int 1 2 2 3 3 3 5 5 5 5 ...
 $ tax: num 296 242 242 222 222 222 311 311 311 311 ...
 $ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black: num 397 397 393 395 397 ...
 $ lstat: num 4.98 9.14 4.03 2.94 5.33 ...
 $ medv: num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

데이터세트에 총 506개의 관측값 과 14개의 변수가 포함되어 있음을 알 수 있습니다.

3단계: 데이터 준비

다음으로 캐럿 패키지의 createDataPartition() 함수를 사용하여 원본 데이터 세트를 훈련 및 테스트 세트로 분할합니다.

이 예에서는 원래 데이터 세트의 80%를 훈련 세트의 일부로 사용하도록 선택합니다.

xgboost 패키지도 행렬 데이터를 사용하므로 data.matrix() 함수를 사용하여 예측 변수를 보유합니다.

 #make this example reproducible
set.seed(0)

#split into training (80%) and testing set (20%)
parts = createDataPartition(data$medv, p = .8 , list = F )
train = data[parts, ]
test = data[-parts, ]

#define predictor and response variables in training set
train_x = data. matrix (train[, -13])
train_y = train[,13]

#define predictor and response variables in testing set
test_x = data. matrix (test[, -13])
test_y = test[, 13]

#define final training and testing sets
xgb_train = xgb. DMatrix (data = train_x, label = train_y)
xgb_test = xgb. DMatrix (data = test_x, label = test_y)

4단계: 모델 조정

다음으로, 각 부스팅 주기에 대한 훈련 및 테스트 RMSE(평균 제곱 오차)를 표시하는 xgb.train() 함수를 사용하여 XGBoost 모델을 조정합니다.

이 예에서는 70개의 라운드를 사용하기로 선택했지만 훨씬 더 큰 데이터 세트의 경우 수백 또는 수천 개의 라운드를 사용하는 것이 드문 일이 아닙니다. 라운드가 많을수록 실행 시간이 길어진다는 점을 명심하세요.

또한 max.degree 인수는 개별 의사결정 트리의 개발 깊이를 지정합니다. 우리는 일반적으로 더 작은 나무를 키우기 위해 이 숫자를 2나 3과 같이 아주 낮은 숫자로 선택합니다. 이 접근 방식은 보다 정확한 모델을 생성하는 경향이 있는 것으로 나타났습니다.

 #define watchlist
watchlist = list(train=xgb_train, test=xgb_test)

#fit XGBoost model and display training and testing data at each round
model = xgb.train(data = xgb_train, max.depth = 3 , watchlist=watchlist, nrounds = 70 )

[1] train-rmse:10.167523 test-rmse:10.839775 
[2] train-rmse:7.521903 test-rmse:8.329679 
[3] train-rmse:5.702393 test-rmse:6.691415 
[4] train-rmse:4.463687 test-rmse:5.631310 
[5] train-rmse:3.666278 test-rmse:4.878750 
[6] train-rmse:3.159799 test-rmse:4.485698 
[7] train-rmse:2.855133 test-rmse:4.230533 
[8] train-rmse:2.603367 test-rmse:4.099881 
[9] train-rmse:2.445718 test-rmse:4.084360 
[10] train-rmse:2.327318 test-rmse:3.993562 
[11] train-rmse:2.267629 test-rmse:3.944454 
[12] train-rmse:2.189527 test-rmse:3.930808 
[13] train-rmse:2.119130 test-rmse:3.865036 
[14] train-rmse:2.086450 test-rmse:3.875088 
[15] train-rmse:2.038356 test-rmse:3.881442 
[16] train-rmse:2.010995 test-rmse:3.883322 
[17] train-rmse:1.949505 test-rmse:3.844382 
[18] train-rmse:1.911711 test-rmse:3.809830 
[19] train-rmse:1.888488 test-rmse:3.809830 
[20] train-rmse:1.832443 test-rmse:3.758502 
[21] train-rmse:1.816150 test-rmse:3.770216 
[22] train-rmse:1.801369 test-rmse:3.770474 
[23] train-rmse:1.788891 test-rmse:3.766608 
[24] train-rmse:1.751795 test-rmse:3.749583 
[25] train-rmse:1.713306 test-rmse:3.720173 
[26] train-rmse:1.672227 test-rmse:3.675086 
[27] train-rmse:1.648323 test-rmse:3.675977 
[28] train-rmse:1.609927 test-rmse:3.745338 
[29] train-rmse:1.594891 test-rmse:3.756049 
[30] train-rmse:1.578573 test-rmse:3.760104 
[31] train-rmse:1.559810 test-rmse:3.727940 
[32] train-rmse:1.547852 test-rmse:3.731702 
[33] train-rmse:1.534589 test-rmse:3.729761 
[34] train-rmse:1.520566 test-rmse:3.742681 
[35] train-rmse:1.495155 test-rmse:3.732993 
[36] train-rmse:1.467939 test-rmse:3.738329 
[37] train-rmse:1.446343 test-rmse:3.713748 
[38] train-rmse:1.435368 test-rmse:3.709469 
[39] train-rmse:1.401356 test-rmse:3.710637 
[40] train-rmse:1.390318 test-rmse:3.709461 
[41] train-rmse:1.372635 test-rmse:3.708049 
[42] train-rmse:1.367977 test-rmse:3.707429 
[43] train-rmse:1.359531 test-rmse:3.711663 
[44] train-rmse:1.335347 test-rmse:3.709101 
[45] train-rmse:1.331750 test-rmse:3.712490 
[46] train-rmse:1.313087 test-rmse:3.722981 
[47] train-rmse:1.284392 test-rmse:3.712840 
[48] train-rmse:1.257714 test-rmse:3.697482 
[49] train-rmse:1.248218 test-rmse:3.700167 
[50] train-rmse:1.243377 test-rmse:3.697914 
[51] train-rmse:1.231956 test-rmse:3.695797 
[52] train-rmse:1.219341 test-rmse:3.696277 
[53] train-rmse:1.207413 test-rmse:3.691465 
[54] train-rmse:1.197197 test-rmse:3.692108 
[55] train-rmse:1.171748 test-rmse:3.683577 
[56] train-rmse:1.156332 test-rmse:3.674458 
[57] train-rmse:1.147686 test-rmse:3.686367 
[58] train-rmse:1.143572 test-rmse:3.686375 
[59] train-rmse:1.129780 test-rmse:3.679791 
[60] train-rmse:1.111257 test-rmse:3.679022 
[61] train-rmse:1.093541 test-rmse:3.699670 
[62] train-rmse:1.083934 test-rmse:3.708187 
[63] train-rmse:1.067109 test-rmse:3.712538 
[64] train-rmse:1.053887 test-rmse:3.722480 
[65] train-rmse:1.042127 test-rmse:3.720720 
[66] train-rmse:1.031617 test-rmse:3.721224 
[67] train-rmse:1.016274 test-rmse:3.699549 
[68] train-rmse:1.008184 test-rmse:3.709522 
[69] train-rmse:0.999220 test-rmse:3.708000 
[70] train-rmse:0.985907 test-rmse:3.705192

결과에서 최소 테스트 RMSE가 56 라운드에서 달성되었음을 알 수 있습니다. 이 지점을 넘어서면 테스트 RMSE가 증가하기 시작하여 훈련 데이터가 과적합되고 있음을 나타냅니다.

따라서 최종 XGBoost 모델을 56라운드를 사용하도록 설정하겠습니다.

 #define final model
final = xgboost(data = xgb_train, max.depth = 3 , nrounds = 56 , verbose = 0 )

참고: verbose=0 인수는 R에게 각 라운드에 대한 훈련 및 테스트 오류를 표시하지 않도록 지시합니다.

5단계: 모델을 사용하여 예측하기

마지막으로 개선된 최종 모델을 사용하여 테스트 세트에서 보스턴 주택의 중앙값을 예측할 수 있습니다.

그런 다음 모델에 대해 다음과 같은 정확도 측정항목을 계산합니다.

MSE: 평균 제곱 오차
MAE: 평균 절대 오차
RMSE: 제곱평균제곱근 오류

 mean((test_y - pred_y)^2) #mse
caret::MAE(test_y, pred_y) #mae
caret::RMSE(test_y, pred_y) #rmse

[1] 13.50164
[1] 2.409426
[1] 3.674457

평균 제곱 오차는 3.674457 입니다. 이는 주택 중앙값에 대한 예측과 테스트 세트에서 관찰된 실제 주택 가치 간의 평균 차이를 나타냅니다.

원하는 경우 이 RMSE를 다중 선형 회귀 , 능선 회귀 , 주성분 회귀 등과 같은 다른 모델과 비교할 수 있습니다. 어떤 모델이 가장 정확한 예측을 생성하는지 확인합니다.

이 예제에 사용된 전체 R 코드는 여기에서 찾을 수 있습니다.

저자 소개

벤자민 앤더슨

안녕하세요. 저는 통계학 교수를 퇴직하고 전임 통계 교사로 변신한 벤자민입니다. 통계 분야의 광범위한 경험과 전문 지식을 바탕으로 Statorials를 통해 학생들에게 힘을 실어주기 위해 지식을 공유하고 싶습니다. 더 알아보기