Summarize simulated data with duckdb

duckdb with arrow lets you summarize massive data sets with ease.

duckdb
Author

Kyle Baron

Published

May 30, 2026

1 Introduction

This is a quick blog post showing how you can summarize a very large set of simulations using duckdb. You can learn more about duckdb here. The duckdb R package can take a long time to install, but will be worth it once your simulations grow out of control. You can get mrgsim.ds here.

2 Setup

library(mrgsim.ds)
library(DBI)
library(duckdb)
library(mirai)
library(dplyr)
library(arrow)

Set up a large simulation; we’ll check the number of rows in the output once it finishes.

  • 3000 subjects
  • 2 dose levels
  • 28 days of dosing
  • Sample every hour
mod <- modlib_ds("popex")
Building popex ... done.
mod <- update(
  mod, 
  outvars = "IPRED", 
  end = 4*168, 
  delta = 1, 
  add = seq(144, 168, 0.1)
)

data <- expand.ev(
  ID = 1:3000,
  amt = c(300, 1000), 
  ii = 24, 
  addl = 27
)

data <- mutate(data, dose = amt)

post <- read.csv(postfile)

head(data)
  ID time amt ii addl cmt evid dose
1  1    0 300 24   27   1    1  300
2  2    0 300 24   27   1    1  300
3  3    0 300 24   27   1    1  300
4  4    0 300 24   27   1    1  300
5  5    0 300 24   27   1    1  300
6  6    0 300 24   27   1    1  300
head(post)
       TVCL      TVV      TVKA
1 0.7585869 19.13091 0.4037127
2 1.0554858 23.50184 0.4186288
3 1.2168882 23.44632 0.4805488
4 0.5308605 29.96097 0.6916013
5 1.0858249 15.72057 0.5701769
6 1.1012112 18.77178 0.5970549

3 Simulate

Simulate 1000 replicates.

unlink("temp-post", recursive = TRUE)

daemons(7, seed = 3579)

everywhere(library(mrgsim.ds))

out <- mirai_map(

  1:1000,
  
  \(x, mod, data, post) {
    
    mod <- param(mod, post[x,])
    
    mrgsim_ds(mod, data, tags = list(rep = x), recover = "dose")
  
  },

  .args = list(mod = mod, data = data, post = post)

)[.]

out <- reduce_ds(out)

daemons(0)

save_ds(out, "temp-post/duckdb.rds")

There are 5.3 billion rows in this simulation set across 1000 parquet files totaling 41 Gb of data on disk.

sims <- read_ds("temp-post/duckdb.rds")

sims
Model: popex
Dim  : 5.3B x 5
Files: 1000 [41 Gb]
Owner: yes (no gc)
    ID time     IPRED dose rep
1:   1    0  0.000000  300   1
2:   1    0  0.000000  300   1
3:   1    1  5.823433  300   1
4:   1    2  9.051879  300   1
5:   1    3 10.752694  300   1
6:   1    4 11.557268  300   1
7:   1    5 11.838984  300   1
8:   1    6 11.818536  300   1

Note that all your simulated data is now safely stored on disk. If your R session crashes or restarts for whatever reason, you can easily restart and pick up right where you left off without having to re-run the expensive simulation step.

It is also important to note that the data are stored in efficiently-organized / compressed parquet files; it’s only the simulation output object (providing access to that data) that is stored in .rds format. So the .rds file is very light.

fs::file_size("temp-post/duckdb.rds")
21.4K

3.1 Summarize using arrow / duckdb pipeline

Summarize using duckdb; we’ll take median, 5th and 95th percentiles of the Week 4 trough concentration by dose and replicate. We can do this in about 15 seconds on my macbook pro m1 with omp.

To do this, we work in the Apache Arrow R package and convert the arrow data set to a duckdb object first, without copying any data.

library(arrow)

system.time({ 

summ <- 
  sims %>% 
  as_duckdb_ds() %>% 
  filter(time == 672) %>% 
  group_by(time, rep, dose) %>% 
  summarise(
    n = n(), 
    Lower = quantile(IPRED, 0.025, na.rm = TRUE), 
    Median = median(IPRED, na.rm = TRUE), 
    Upper = quantile(IPRED, 0.975, na.rm = TRUE), 
    .groups = "drop"
  ) %>% arrange(
    time, 
    rep, 
    dose
  ) %>% collect()

})
   user  system elapsed 
 79.623  12.353  14.244 
summ
# A tibble: 2,000 × 7
    time   rep  dose     n  Lower Median Upper
   <dbl> <int> <dbl> <dbl>  <dbl>  <dbl> <dbl>
 1   672     1   300  3000  1.36   10.9   43.5
 2   672     1  1000  3000  4.34   38.9  149. 
 3   672     2   300  3000  0.792   7.37  28.6
 4   672     2  1000  3000  2.41   25.2   99.8
 5   672     3   300  3000  0.510   5.98  25.5
 6   672     3  1000  3000  1.33   19.2   81.8
 7   672     4   300  3000  4.16   19.5   63.8
 8   672     4  1000  3000 13.8    64.1  207. 
 9   672     5   300  3000  0.260   5.44  28.3
10   672     5  1000  3000  0.567  17.3   89.2
# ℹ 1,990 more rows

Another summary: generate a 95% confidence interval around Cmax after the seventh dose.

system.time({ 

summ2 <- 
  sims %>% 
  as_duckdb_ds() %>% 
  filter(time >= 144 & time <= 168) %>%
  summarise(
    Cmax = max(IPRED, na.rm = TRUE), 
    .by = c(ID, rep, dose)
  ) %>% 
  compute() %>% 
  summarise(
    n = n(),
    Median = median(Cmax, na.rm = TRUE), 
    .by = c(rep, dose)
  ) %>% 
  arrange(
    rep, 
    dose
  ) %>% collect()
  
})
   user  system elapsed 
143.898  14.733  24.781 
summ2
# A tibble: 2,000 × 4
     rep  dose     n Median
   <int> <dbl> <dbl>  <dbl>
 1     1   300  3000   20.8
 2     1  1000  3000   71.4
 3     2   300  3000   15.6
 4     2  1000  3000   52.5
 5     3   300  3000   14.6
 6     3  1000  3000   47.8
 7     4   300  3000   24.7
 8     4  1000  3000   82.8
 9     5   300  3000   18.6
10     5  1000  3000   62.8
# ℹ 1,990 more rows

There appears to be substantial uncertainty in the median Cmax.

summ2 %>% 
  summarise(
    n = n(), 
    Lower = quantile(Median, 0.025), 
    Med = median(Median),
    Upper = quantile(Median, 0.975), 
    .by = dose
  )
# A tibble: 2 × 5
   dose     n Lower   Med Upper
  <dbl> <int> <dbl> <dbl> <dbl>
1   300  1000  12.7  16.9  24.7
2  1000  1000  42.6  56.0  81.8

4 Summarize using duckdb

This summary uses straight duckdb, writing the summary specification in SQL code. I’ve found some applications where this is slightly faster than the arrow / duckdb pipeline; in this example, the timing is identical.

con <- dbConnect(duckdb())

ds <- as_arrow_ds(sims)

duckdb_register_arrow(con, "sims", ds)

system.time(
  
  summ3 <- dbGetQuery(con, "
    SELECT
      dose,
      time,
      rep, 
      COUNT(*)                     AS n,
      QUANTILE_CONT(IPRED, 0.025)  AS Lower,
      QUANTILE_CONT(IPRED, 0.5)    AS Median,
      QUANTILE_CONT(IPRED, 0.975)  AS Upper
    FROM sims
    WHERE time = 672
    GROUP BY time, rep, dose
    ORDER BY time, rep, dose
  ")

)

dbDisconnect(con, shutdown = TRUE)

as_tibble(summ3) %>% arrange(time, rep, dose)