library(mrgsim.ds)
library(DBI)
library(duckdb)
library(mirai)
library(dplyr)
library(arrow)1 Introduction
This is a quick blog post showing how you can summarize a very large set of simulations using duckdb. You can learn more about duckdb here. The duckdb R package can take a long time to install, but will be worth it once your simulations grow out of control. You can get mrgsim.ds here.
2 Setup
Set up a large simulation; we’ll check the number of rows in the output once it finishes.
- 3000 subjects
- 2 dose levels
- 28 days of dosing
- Sample every hour
mod <- modlib_ds("popex")Building popex ... done.
mod <- update(
mod,
outvars = "IPRED",
end = 4*168,
delta = 1,
add = seq(144, 168, 0.1)
)
data <- expand.ev(
ID = 1:3000,
amt = c(300, 1000),
ii = 24,
addl = 27
)
data <- mutate(data, dose = amt)
post <- read.csv(postfile)
head(data) ID time amt ii addl cmt evid dose
1 1 0 300 24 27 1 1 300
2 2 0 300 24 27 1 1 300
3 3 0 300 24 27 1 1 300
4 4 0 300 24 27 1 1 300
5 5 0 300 24 27 1 1 300
6 6 0 300 24 27 1 1 300
head(post) TVCL TVV TVKA
1 0.7585869 19.13091 0.4037127
2 1.0554858 23.50184 0.4186288
3 1.2168882 23.44632 0.4805488
4 0.5308605 29.96097 0.6916013
5 1.0858249 15.72057 0.5701769
6 1.1012112 18.77178 0.5970549
3 Simulate
Simulate 1000 replicates.
unlink("temp-post", recursive = TRUE)
daemons(7, seed = 3579)
everywhere(library(mrgsim.ds))
out <- mirai_map(
1:1000,
\(x, mod, data, post) {
mod <- param(mod, post[x,])
mrgsim_ds(mod, data, tags = list(rep = x), recover = "dose")
},
.args = list(mod = mod, data = data, post = post)
)[.]
out <- reduce_ds(out)
daemons(0)
save_ds(out, "temp-post/duckdb.rds")There are 5.3 billion rows in this simulation set across 1000 parquet files totaling 41 Gb of data on disk.
sims <- read_ds("temp-post/duckdb.rds")
simsModel: popex
Dim : 5.3B x 5
Files: 1000 [41 Gb]
Owner: yes (no gc)
ID time IPRED dose rep
1: 1 0 0.000000 300 1
2: 1 0 0.000000 300 1
3: 1 1 5.823433 300 1
4: 1 2 9.051879 300 1
5: 1 3 10.752694 300 1
6: 1 4 11.557268 300 1
7: 1 5 11.838984 300 1
8: 1 6 11.818536 300 1
Note that all your simulated data is now safely stored on disk. If your R session crashes or restarts for whatever reason, you can easily restart and pick up right where you left off without having to re-run the expensive simulation step.
It is also important to note that the data are stored in efficiently-organized / compressed parquet files; it’s only the simulation output object (providing access to that data) that is stored in .rds format. So the .rds file is very light.
fs::file_size("temp-post/duckdb.rds")21.4K
3.1 Summarize using arrow / duckdb pipeline
Summarize using duckdb; we’ll take median, 5th and 95th percentiles of the Week 4 trough concentration by dose and replicate. We can do this in about 15 seconds on my macbook pro m1 with omp.
To do this, we work in the Apache Arrow R package and convert the arrow data set to a duckdb object first, without copying any data.
library(arrow)
system.time({
summ <-
sims %>%
as_duckdb_ds() %>%
filter(time == 672) %>%
group_by(time, rep, dose) %>%
summarise(
n = n(),
Lower = quantile(IPRED, 0.025, na.rm = TRUE),
Median = median(IPRED, na.rm = TRUE),
Upper = quantile(IPRED, 0.975, na.rm = TRUE),
.groups = "drop"
) %>% arrange(
time,
rep,
dose
) %>% collect()
}) user system elapsed
79.623 12.353 14.244
summ# A tibble: 2,000 × 7
time rep dose n Lower Median Upper
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 672 1 300 3000 1.36 10.9 43.5
2 672 1 1000 3000 4.34 38.9 149.
3 672 2 300 3000 0.792 7.37 28.6
4 672 2 1000 3000 2.41 25.2 99.8
5 672 3 300 3000 0.510 5.98 25.5
6 672 3 1000 3000 1.33 19.2 81.8
7 672 4 300 3000 4.16 19.5 63.8
8 672 4 1000 3000 13.8 64.1 207.
9 672 5 300 3000 0.260 5.44 28.3
10 672 5 1000 3000 0.567 17.3 89.2
# ℹ 1,990 more rows
Another summary: generate a 95% confidence interval around Cmax after the seventh dose.
system.time({
summ2 <-
sims %>%
as_duckdb_ds() %>%
filter(time >= 144 & time <= 168) %>%
summarise(
Cmax = max(IPRED, na.rm = TRUE),
.by = c(ID, rep, dose)
) %>%
compute() %>%
summarise(
n = n(),
Median = median(Cmax, na.rm = TRUE),
.by = c(rep, dose)
) %>%
arrange(
rep,
dose
) %>% collect()
}) user system elapsed
143.898 14.733 24.781
summ2# A tibble: 2,000 × 4
rep dose n Median
<int> <dbl> <dbl> <dbl>
1 1 300 3000 20.8
2 1 1000 3000 71.4
3 2 300 3000 15.6
4 2 1000 3000 52.5
5 3 300 3000 14.6
6 3 1000 3000 47.8
7 4 300 3000 24.7
8 4 1000 3000 82.8
9 5 300 3000 18.6
10 5 1000 3000 62.8
# ℹ 1,990 more rows
There appears to be substantial uncertainty in the median Cmax.
summ2 %>%
summarise(
n = n(),
Lower = quantile(Median, 0.025),
Med = median(Median),
Upper = quantile(Median, 0.975),
.by = dose
)# A tibble: 2 × 5
dose n Lower Med Upper
<dbl> <int> <dbl> <dbl> <dbl>
1 300 1000 12.7 16.9 24.7
2 1000 1000 42.6 56.0 81.8
4 Summarize using duckdb
This summary uses straight duckdb, writing the summary specification in SQL code. I’ve found some applications where this is slightly faster than the arrow / duckdb pipeline; in this example, the timing is identical.
con <- dbConnect(duckdb())
ds <- as_arrow_ds(sims)
duckdb_register_arrow(con, "sims", ds)
system.time(
summ3 <- dbGetQuery(con, "
SELECT
dose,
time,
rep,
COUNT(*) AS n,
QUANTILE_CONT(IPRED, 0.025) AS Lower,
QUANTILE_CONT(IPRED, 0.5) AS Median,
QUANTILE_CONT(IPRED, 0.975) AS Upper
FROM sims
WHERE time = 672
GROUP BY time, rep, dose
ORDER BY time, rep, dose
")
)
dbDisconnect(con, shutdown = TRUE)
as_tibble(summ3) %>% arrange(time, rep, dose)