Bioinformatics Playground

Live, learn & play safe.

Tracking progress in R

Admit it or not, we human beings become anxious and impatient when it comes to waiting. Especially when we are blindfolded — that is, we are unaware of how long we have to suffer the endless wait. As pointed out by Brad Allan Myers, arguably the designer of the progress bar in the 1980s, being able to track the progress during the waiting can significantly improve the user experience (Myers, 1985).

A typical progress bar by Simeon87, via Wikimedia Commons.

As an R programmer in bioinformatic research, often my codes are not designed for the general public, but it is important to make sure that my users, namely my fellow colleagues and researchers, are as happy as possible. However, tracking process in R can be tricky. In this article, I am going to present you some approaches and my solution - pbmcapply.

Print

The easiest way to handle progress tracking in R is to periodically print the percentage of completion to the output, which is the screen by default, or write it to your log file located somewhere on the disk. Needless to say, this is probably the least elegant way to solve the problem, but many people are still following this path nowadays.

Pbapply

A better (and still easy) solution is to adopt a package named pbapply. According to its dev page, the package has been very popular — 90k downloads. The package is easy to use. Whenever you are about to call the apply function, use the pbapply version of it. For example:

# Some numbers we are going to work with
nums <- 1:10  
# Let's call the lapply to get the sqare root of these numbers
sqrt <- sapply(nums, sqrt)  
# Now let's track the process using pbapply package
sqrt <- pbsapply(nums, sqrt)  

While the numbers are processed, a progress bar will be printed to the output and refreshed repeatedly.

A progress bar generated by pbapply. A user is able to get the estimated count-down time together with the current progress in the form of a progress bar.

Although pbapply is a great tool and I use it frequently, it failed to track the progress of the paralleled version of apply — mcapply — until recently. In September, the author of pbapply updated his package with support to snow-type clusters and multicore-type forking. However, his approach relies on splitting the elements into fractions and applies mcapply to them sequentially. One caveat of this approach is that if the number of elements is significantly higher than the number of cores, a lot of mcapply calls will be executed. Mcapply calls, which is built upon the fork() function in Unix/Linux, is very expensive: forking into lots of child processes is time consuming and creates memory overhead.

Notice that pbapply generates lots of child processes while pbmcapply reuses them as much as possible. Four cores were allocated for pbapply/pbmcapply. R code used for benchmark can be downloaded here.
Pbmcapply

Pbmcapply is my own solution to address this problem. Available as a CRAN package, it can be easily incorporated into your code:

# Install pbmcapply
install.packages("pbmcapply")  

As you might have realized by its name, I was inspired by the pbapply package. Unlike pbapply, my solution does not rely on executing multiple mcapply calls. Instead, pbmcapply takes advantages of a package named future.

The flowchart of pbmcapply. FutureCall() is executed in a separate process which will then be forked into defined amount of child processes. Child processes will update their progress frequently with the progressMonitor. Once received the updates, progressMonitor will print the progress to the standard output.

In Computer Science, future refers to an object that will hold values later. It allows the program to execute some code as a future and, without waiting for the return, proceed to the next step. In pbmcapply, mcapply will be wrapped into a future. The future will update the main program with its progress periodically, and the main program will maintain a progress bar to display the updates.

Because the overhead was minimal and non-linear in pbmcapply, a dramatic increase of performance is seen when the number of elements to iterate over is significantly bigger than the number of CPU cores. Single-thread and multi-threaded apply functions from the R base are used as reference. It is obvious that even with pbmcapply, the performance is affected due to time required to set up the monitor process.

Performance between pbapply and pbmcapply. R code used for the benchmark can be downloaded from here. The left panel shows the overhead generated when each package was called. Right panel displays the time elapsed for the call to return.

Everything comes at a price. When enjoying the convenience of interactive progress tracking, please keep in mind that it slightly slows down the program.

Conclusion

Like always, one shoe doesn’t fit all. If performance is your top priority (e.g. when running a program on a cluster), a better way to track progress might be print. On the other hand, if letting the program run for extra second sounds reasonable, you are more than welcome to check either my solution (pbmcapply) or pbapply to take a more user-friendly approach.


Reference
  1. Myers, B. A. (1985). The importance of percent-done progress indicators for computer-human interfaces. In ACM SIGCHI Bulletin (Vol. 16, №4, pp. 11–17). ACM.

Comments is loading...

Comments is loading...