Data.table과 r의 데이터 프레임: 세 가지 주요 차이점

에 의해 벤자민 앤더슨 7월 16, 2023 가이드 댓글 0개

R 프로그래밍 언어에서 data.frame은 R 데이터베이스의 일부입니다.

data.table 패키지의 setDF 함수를 사용하여 모든 data.frame을 data.table 로 변환할 수 있습니다.

data.table은 R의 data.frame에 비해 다음과 같은 장점을 제공합니다.

1. data.table 패키지의 fread 함수를 사용하면 파일을 data.frame으로 읽는 read.csv 와 같은 기본 R 함수보다 훨씬 빠르게 파일을 data.table로 읽을 수 있습니다.

2. data.frame보다 훨씬 빠르게 data.table에서 작업(예: 그룹화 및 집계)을 수행할 수 있습니다.

3. data.frame을 콘솔에 인쇄할 때 R은 data.frame의 각 행을 인쇄하려고 시도합니다. 그러나 data.table은 처음 100개의 행만 표시하므로 대규모 데이터 세트로 작업하는 경우 세션이 중단되거나 충돌하는 것을 방지할 수 있습니다.

다음 예에서는 실제로 data.frames와 data.tables 간의 차이점을 보여줍니다.

차이점 #1: fread를 사용하여 더 빠르게 가져오기

다음 코드는 data.table 패키지의 fread 함수와 R 데이터베이스의 read.csv 함수를 사용하여 10,000개의 행과 100개의 열로 구성된 데이터 프레임을 가져오는 방법을 보여줍니다.

 library (microbenchmark)
library (data.table)

#make this example reproducible
set. seeds (1)

#create data frame with 10,000 rows and 100 columns
df <- as. data . frame (matrix(runif(10^4 * 100), nrow = 10^4))

#export CSV to current working directory
write.write. csv (df, " test.csv ", quote = FALSE )

#import CSV file using fread and read.csv and time how long it takes
results <- microbenchmark(
  read.csv = read. csv (" test.csv ", header = TRUE , stringsAsFactors = FALSE ),
  fread = fread(" test.csv ", sep = ",", stringsAsFactors = FALSE ),
  times = 10)

#view results
results

Unit: milliseconds
     expr min lq mean median uq max neval cld
 read.csv 817.1867 892.8748 1026.7071 899.5755 926.9120 1964.0540 10 b
    fread 113.5889 116.2735 136.4079 124.3816 136.0534 211.7484 10 a

결과에서 우리는 read.csv 함수에 비해 이 CSV 파일을 가져오는 데 fread 가 약 10배 더 빠르다는 것을 알 수 있습니다.

데이터 세트가 클수록 이 차이는 더욱 커집니다.

차이점 #2: data.table을 사용한 더 빠른 데이터 조작

일반적으로 data.table은 data.frame 보다 훨씬 빠르게 데이터 조작 작업을 수행할 수도 있습니다.

예를 들어, 다음 코드는 data.table과 data.frame 모두에서 다른 변수로 그룹화된 변수의 평균을 계산하는 방법을 보여줍니다.

 library (microbenchmark)
library (data.table)

#make this example reproducible
set.seed(1)

#create data frame with 10,000 rows and 100 columns
d_frame <- data. frame (team=rep(c(' A ', ' B '), each=5000),
                      points=c(rnorm(10000, mean=20, sd=3)))

#create data.table from data.frame
d_table <- setDT(d_frame)

#calculate mean of points grouped by team in data.frame and data.table
results <- microbenchmark(
  mean_d_frame = aggregate(d_frame$points, list(d_frame$team), FUN=mean),
  mean_d_table = d_table[ ,list(mean=mean(points)), by=team],
  times = 10)

#view results
results

Unit: milliseconds
         expr min lq mean median uq max neval cld
 mean_d_frame 2.9045 3.0077 3.11683 3.1074 3.1654 3.4824 10 b
 mean_d_table 1.0539 1.1140 1.52002 1.2075 1.2786 3.6084 10 a

결과에서 data.table이 data.frame 보다 약 3배 빠르다는 것을 알 수 있습니다.

더 큰 데이터 세트의 경우 이 차이는 더욱 커집니다.

차이점 #3: data.table로 인쇄되는 행 수가 적습니다.

data.frame을 콘솔에 인쇄할 때 R은 data.frame의 각 행을 인쇄하려고 시도합니다.

그러나 data.table은 처음 100개의 행만 표시하므로 대규모 데이터 세트로 작업하는 경우 세션이 중단되거나 충돌하는 것을 방지할 수 있습니다.

예를 들어 다음 코드에서는 데이터 프레임과 200개 행의 data.table을 모두 생성합니다.

data.frame을 인쇄할 때 R은 각 행을 인쇄하려고 시도하는 반면 data.table은 처음 5개 행과 마지막 5개 행만 표시합니다.

 library (data.table)

#make this example reproducible
set. seeds (1)

#create data frame
d_frame <- data. frame (x=rnorm(200),
                      y=rnorm(200),
                      z=rnorm(200))
#view data frame
d_frame

               X Y Z
1 -0.055303118 1.54858564 -2.065337e-02
2 0.354143920 0.36706204 -3.743962e-01
3 -0.999823809 -1.57842544 4.392027e-01
4 2.586214840 0.17383147 -2.081125e+00
5 -1.917692199 -2.11487401 4.073522e-01
6 0.039614766 2.21644236 1.869164e+00
7 -1.942259548 0.81566443 4.740712e-01
8 -0.424913746 1.01081030 4.996065e-01
9 -1.753210825 -0.98893038 -6.290307e-01
10 0.232382655 -1.25229873 -1.324883e+00
11 0.027278832 0.44209325 -3.221920e-01
...
#create data table
d_table <- setDT(d_frame)

#view data table
d_table

               X Y Z
  1: -0.05530312 1.54858564 -0.02065337
  2: 0.35414392 0.36706204 -0.37439617
  3: -0.99982381 -1.57842544 0.43920275
  4: 2.58621484 0.17383147 -2.08112491
  5: -1.91769220 -2.11487401 0.40735218
 ---                                    
196: -0.06196178 1.08164065 0.58609090
197: 0.34160667 -0.01886703 1.61296255
198: -0.38361957 -0.03890329 0.71377217
199: -0.80719743 -0.89674205 -0.49615702
200: -0.26502679 -0.15887435 -1.73781026

이는 특히 실수로 콘솔에 인쇄하고 싶지 않은 대규모 데이터 세트로 작업할 때 data.table이 data.frame 에 비해 제공하는 이점입니다.

추가 리소스

다음 튜토리얼에서는 R에서 다른 일반적인 작업을 수행하는 방법을 설명합니다.

R의 데이터 프레임에 행을 추가하는 방법
R에서 특정 열을 보존하는 방법
R에서 숫자 열만 선택하는 방법

저자 소개

벤자민 앤더슨

안녕하세요. 저는 통계학 교수를 퇴직하고 전임 통계 교사로 변신한 벤자민입니다. 통계 분야의 광범위한 경험과 전문 지식을 바탕으로 Statorials를 통해 학생들에게 힘을 실어주기 위해 지식을 공유하고 싶습니다. 더 알아보기