clustermq – faq

FAQ

Important things to be aware of

global data can be exported via the export parameter as a list
packages needed can equally exported to the worker nodes as a character vector via pkgs
additional values in the respective clustermq template can be set via the template option as a list, e.g. template=list(memory = 1024, cores = 1)
worker log files can be switched on by setting log_worker = TRUE

How does this all fit into RStudio IDE ?

The usage of clustermq and its Q function opens possibilities for both RStudoi IDE Open Source and the professional version.

Irrespective of the version and whether professional product or Open Source, you always can run the Q function on the R Console which itself will then use the HPC environment your R session is running on and spawn the required jobs.

Additionally you can use the “Background jobs” feature in Open Source or the “Workbench jobs” in the professional version to farm out the possibly longer running Q function into a non-interactive job. Please note that the resources required for such a non-interactive job are very minimal (1 core) as this will only be the Master process - all the Workers will be spawned by this Master Process into separate jobs.

RStudio Workbench jobs & SLURM

The usage of the Workbench jobs feature can be a bit cumbersome if the main clustermq process is run in such a workbench job. This is due to the PATH environment variable not set to contain the information where to find the SLURM binaries. As a workaround we recommend to add the following lines to your .Rprofile in your home-directory (alternatively get your IT admin to add it to Rprofile.site within your R installation).

#set SLURM binaries PATH so that RSW Launcher jobs work
slurm_bin_path<-"/opt/slurm/bin"

curr_path<-strsplit(Sys.getenv("PATH"),":")[[1]]

if (!(slurm_bin_path %in% curr_path)) {
  if (length(curr_path) == 0) {
     Sys.setenv(PATH = slurm_bin_path)
  } else {
     Sys.setenv(PATH = paste0(Sys.getenv("PATH"),":",slurm_bin_path))
  }
}

What happens if I don’t have any HPC cluster available to run my `clustermq` based code ?

This is no problem at all, you then can simply remove the clustermq.template option and clustermq will fall back to local execution without any further code changes. But you still can also parallelize locally by using multicore or multiprocess as clustermq.scheduler

What happens if the HPC cluster I would like to work got no RStudio installation ?

Not all hope is lost in this case either. clustermq also supports the ssh connector. This allows you to remote run your R code on any remote host you are allowed to log in via ssh. If you set up passwordless ssh connection to the login node of your HPC cluster, for example, you can set

options(
    clustermq.scheduler = "ssh",
    clustermq.ssh.host = "user@hpclogin", # use your user and login node
    clustermq.ssh.log = "~/cmq_ssh.log" # log for easier debugging
)

Depending on the overall setup of the RStudio Server and the HPC cluster (e.g. R versions and installation directories, home-directory location) you may need to tweak the provided default ssh template Note: clustermq will use the ssh connection and once on the HPC cluster will detect and use the appropriate scheduler for submitting jobs.

Measuring code execution time in R

For the purposes in this document, we are using the microbenchmark package. This allows the execution of selected code chunks a number of times to average over typical OS jitter. An example would be

library(microbenchmark)
func <- function (x) x*x 
microbenchmark(func(10))

Unit: nanoseconds
     expr min  lq     mean median   uq     max neval
 func(10) 885 893 16892.86    901 1006 1576467   100

Note: By default the function call is evaluated 100 times. This can be changed by adding the optional parameter times=X where X is the desired amount of function calls to run.

FAQ