Same Job, Different Seed

Often, it is desirable to submit the same job to a computing cluster, changing only a few variables each time. For example, you may wish to run a simulation where random variables are generated from the same distribution, but each time with a different seed. Here, we will be generating 100 observations from an exponential distribution with 100 different seeds to illustrate how to submit jobs with different seeds. All of the files from this post can be downloaded from my github.

First, move to the directory on the cluster in which the file you are running is. When you submit the jobs to the cluster, you can submit them with the following code:

#!/bin/bash
for seed in {1..100}
do
qsub -cwd batch.sh $seed
done

Here we have 100 seeds — in this example I am just using the numbers 1 through 100.  We use a for loop which passes a different seed when submitting each job, indicated by the $seed at the end of the qsub command. If you want to pass n variables to your job this can be done by simply writing $var1 $var2 … $varn at the end of the qsub command. The command ‘-cwd’ tells us that the output from our job should go in the current working directory.  Each job is submitted through the file batch.sh. The batch.sh file reads:

Rscript random_exponential.R $1

Rscript is a command mode of R with a terminal and can be used as a shell script.  This is followed by the name of the R file we wish to submit. The last part indicates that we will be passing the seed to the R job. If you wish to pass n variable to your R job, you can pass them by having $1 $2 … $n after the name of the file.

Now, lets look at the file random_exponetial.R

seed <- commandArgs(TRUE)
##set the seed
set.seed(seed)
##generate 100 exponential random variables with rate = 1
random.exp <- rexp(100, rate = 1)
##take the mean of the random variables
mean.exp <- mean(random.exp)
##save the mean
save(mean.exp, file = paste0(seed, 'output.Rdata'))

The command commandArgs() provides access to a copy of the command line arguments supplied when this R session was invoked. We have assigned this to the variable ‘seed’. Note that if you have passed n variables to your R job, ‘seed’ will be a vector of length n. The remainder of the R code generate the 100 exponential random variables, calculates the mean, and saves the mean.  Notice in the last line, the seed is used in the name of the output file, which will be extremely useful for the next step.

To combine the means, open and R session on the cluster and then run the following code from the combine.R file:

##set working directory to location of the output of random_exponential.R
setwd()
##create a vector to collect the means
mean <- c()
## load in the mean from each set
for (i in 1:100){
load(paste0(i, 'output.Rdata'))
mean[i] <- mean.exp
}
##plot a histogram of the means
pdf('Hist.pdf')
hist(means, main = 'Histogram of Means')
dev.off()

The file combine.R produces this plot (an illustration of the Central Limit Theorem):

Hist copy

 

Acknowledgments:  Sincerest thanks to Taki Shinohara who taught me how to do this!  And also many thanks to David Lenis and Jean-Philippe Fortin for the useful feedback on this post.

Advertisements
Posted in R

6 thoughts on “Same Job, Different Seed

  1. jongellar

    An alternative is to submit the job as an “array job”:

    qsub script.sh -t 1:100

    this will submit script.sh 100 times, each with a different “task ID” from 1 to 100. script.sh will call an R script, and the R script should have a line such as:

    runID <- as.numeric(Sys.getenv("SGE_TASK_ID"))
    set.seed(runID)

    The effect is basically the same thing as what you have above, except that the "work" of submitting the 100 jobs is done by SGE instead of by the shell script. The one advantage of submitting as an array job is that SGE considers it one job, and will assign it a single ID (with different task ID's, which is like a "sub-ID"). This way if you have to, for example, kill the whole thing, you can do it with a single qdel statement using that one ID. When you run qstat, it lists the jobs as jobID:taskID. It is just a little more organized.

    Instead of just setting seeds, I've also used array jobs to set up different parameters for a simulation. For example, you can use -t 1:24, and then build in a bunch of logic at the start of your R script to determine which of the 24 scenarios you are in. This usually involves a bunch of modulus operators (%%) in some arithmetic.

  2. Kasper Hansen

    I second the comment on array jobs.

    More importantly, you should be aware that there is NO guarantee that random streams started with different seeds are independent, although many people don’t realize this. So basically your final “mean” vector (used for plotting) cannot be assumed to contain independent samples. The way to solve this is either 1) generate all the random numbers in one job or 2) use a parallel random number generator which ensures that only one random number stream is used and that this stream is properly synchronized across threads. This second option really means you need to have one master R process and then spawn sub jobs.

    • elizabethmargaretsweeney

      Good to know! I am clearly one of the many people who did not realize this. I have always assumed that if you set different seeds the streams of random numbers from each seed would be independent.

      Perhaps this post would be better cast as a skeleton for creating embarrassingly parallel jobs on the cluster, and the choice of example is not the greatest because of what you point out above.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s