Parallelize 'pbapply' functions
Henrik Bengtsson
Source:vignettes/futurize-55-pbapply.md
futurize-55-pbapply.Rmd
+
=

The futurize package allows you to easily turn
sequential code into parallel code by piping the sequential code to the
futurize() function. Easy!
Introduction
This vignette demonstrates how to use this approach to parallelize
pbapply
functions such as pblapply(), pbsapply(), and
pbvapply().
The pbapply
package provides progress-bar versions of the base-R
*apply() family of functions. It supports parallel
processing via the cl argument, which accepts a PSOCK
cluster object or, when used with futurize, the string
"future".
Example: Parallel lapply with progress bar
The pblapply() function works like lapply()
but displays a progress bar. For example:
library(pbapply)
slow_fcn <- function(x) {
Sys.sleep(0.1) # emulate work
sqrt(x)
}
## Apply a function to each element with a progress bar
xs <- 1:100
ys <- pblapply(xs, slow_fcn)Here pblapply() evaluates sequentially, but we can
easily make it evaluate in parallel by piping to
futurize():
library(pbapply)
library(futurize)
plan(multisession) ## parallelize on local machine
xs <- 1:100
ys <- pblapply(xs, slow_fcn) |> futurize()Comment: The message("x = ", x) output is not relayed to
the main R session by design, because if it were, it would clutter up
the progress bar that pbapply renders, which is the
whole purpose of using pbapply in the first place.
The built-in multisession backend parallelizes on your
local computer and works on all operating systems. There are other parallel
backends to choose from, including alternatives to parallelize
locally as well as distributed across remote machines, e.g.
plan(future.mirai::mirai_multisession)and
plan(future.batchtools::batchtools_slurm)Supported Functions
The following pbapply functions are supported by
futurize():
Without futurize: Manual PSOCK cluster setup
For comparison, here is what it takes to parallelize
pblapply() using the parallel package
directly, without futurize:
library(pbapply)
library(parallel)
## Set up a PSOCK cluster
ncpus <- 4L
cl <- makeCluster(ncpus)
## Run pblapply in parallel
xs <- 1:100
ys <- pblapply(xs, slow_fcn, cl = cl)
## Tear down the cluster
stopCluster(cl)This requires you to manually create and manage the cluster
lifecycle. If you forget to call stopCluster(), or if your
code errors out before reaching it, you leak background R processes. You
also have to decide upfront how many CPUs to use and what cluster type
to use. Switching to another parallel backend, e.g. a Slurm cluster,
would require a completely different setup. With
futurize, all of this is handled for you - just pipe to
futurize() and control the backend with
plan().
Progress Reporting via progressr
An alternative to using pbapply for progress
reporting is to use the progressr
package, which is specially designed to work with the Futureverse
ecosystem and provide progress updates from parallelized computations in
a near-live fashion. See the
vignette("futurize-11-apply", package = "futurize") for
more details.