Updated 2017-09-28 David Trudgian
CellProfiler from the Broad Institute is a powerful application for processing biological images in a pipeline. It offers an easy-to-use GUI environment to create and run pipelines against any number of images.
The latest version of CellProfiler installed on BioHPC can be run by using the following commands in an interactive GUI session, or on a workstation/thin-client.
module add cellprofiler
cellprofiler
In addition to the base CellProfiler application, CellProfiler Analyst is available. This tool supports interactive exploration and analysis of data. To start CellProfiler Analyst use the following commands in an interactive GUI session, or on a workstation/thin-client:
module add cellprofiler-analyst
CellProfiler-Analyst.py
If you are performing analysis with CellProfiler, the easiest way to run your project is to use the GUI within an webGUI session started from the BioHPC User Portal.
The CellProfiler GUI will automatically use all CPU cores available on the machine when an analysis is run.
A webGUI session is limited to 20 hours. If you sometimes need longer than 20 hours to complete your analysis please email biohpc-help@utsouthwestern.edu to request an extension for your webGUI session. We are happy to grant extensions for valid reasons, as long as sufficient notice is given.
If you find that your analysis takes a long time to complete, you may want to read on, to learn about running batched analyses through a cluster job.
CellProfiler has a batch processing feature that can be used to run jobs on a cluster independent of the graphical interface. It is flexible enough to support running large analysis projects across multiple nodes in the Nucleus cluster, to speed up time to completion. However, it is not straightforward to run a batch analysis.
There are 5 basic steps to performing a batch analysis:
CellProfiler's batch mode depends on a batch file that contains information about what processing is needed for each image set. This file must be generated through the CellProfiler GUI.
You can convert a project to a batch analysis as follows:
The CreateBatchFiles step has some default settings that need to be modified:
Once the BatchProfiler tool is configured at the end of your pipeline you need to create the batch data file.
CellProfiler will no process the list of files and the pipeline to compute information that is needed to setup and run batch processing. When complete, a file named Batch_data.h5
will be created in the pipeline output directory.
Before writing a batch script that will process all of your image sets across multiple nodes it's important to understand a little about how CellProfiler is used to analyze a batch of images, in a non-interactive manner.
In a terminal session, cd to the location of your pipeline file and Batch_data.h5
that you created above.
We can then run a test batch analysis, on a batch consisting of only the first 10 input image sets (for speed):
module add cellprofiler/2.2.0-20160712
# -p batch file location
# -c no GUI
# -r run the pipeline on startup
# -f first image in batch
# -l last image in batch
# -o name of output file for the batch
cellprofiler -p Batch_data.h5 -c -r -b -f 1 -l 10 -o batch1.out
Note here that the -f
, -l
, -o
options are the most important, and need to be changed for each batch when we run our full analysis.
-f
specifies the index (starting at 1) for the first image set we will process in this batch.-l
specifies the index for the last image set we will process in this batch.-o
specifies the output directory for any files created from this batch.*Note* it's strongly advised to use the -o
option to set a different output directory for each batch you run in parallel. When you are running programs across multiple nodes it is easy to have delays or freezes caused by locking when multiple nodes try to create, modify, and delete files in the same directory. Writing output to batch-specific directories prevents any problems with file locking from multiple nodes.
An important thing to note, or observe, is that when run in batch mode, CellProfiler only uses one CPU core for processing, instead of all available cores used when running a complete analysis from the GUI. This is important - to use all CPU cores on multiple nodes we will need to run as many batches in parallel as we have CPU cores.
To process our complete project across multiple compute nodes we will need to write a batch script, that can be submitted to the SLURM job scheduler. This script will need to:
Decide in advance on the number of nodes you wish to use. You can use up to 16 nodes concurrently, but we recommend using a lower number that's sufficient to keep total run times in the range of 4-12 hours. Requesting a lot of nodes to complete your analysis very quickly may result in long waits on the cluster queue, and the higher number of batched outputs may be more difficult to combine.
The following script performs analysis of a project with 1000 image sets across 4 nodes, in batches of 100 image sets.
We use 4 nodes from the 256GBv1 cluster. Each node has 56 logical cores, or 28 physical cores. CellProfiler is best run using a physical core for each process, so we have 4 * 28 = 112 cores available across 4 nodes.
It's easiest to use equal size batches for processing, so we will run 100 batches of 10 image sets each, to complete our 1000 image set project. This wastes 12 of the 112 cores available, but that is a small percentage and dealing with uneven batch sizes to use all cores is difficult.
Note In future we will update our BioHPC Parameter Runner tool to make submitting Cell Profiler batch processing easier. At present it doesn't support computing start and end indexes for a range, so cannot be used for this task.
Use the script below as a template to setup your batch processing experiment. Create the script in the same location as the Batch_data.h5
file that was generated previously. Read the comments in the script, as you will need to change some settings related to batch sizes etc.
#!/bin/bash
# Use the 256GBv1 partition
# (56 physical, 28 physical cores per node)
# You can use other partitions, see the portal for details about cores per node.
# Don't use super as it gives mixed nodes with different configurations.
#SBATCH -p 256GBv1
# Use 4 nodes total
#SBATCH -N 4
# Time Limit of 1 day
#SBATCH -t 1-00:00:00
# CellProfiler creates lots of threads, raise our process limit
# or batches may fail
ulimit -u 4096
# Load the cellprofiler module
module add cellprofiler/2.2.0-20160712
# We have 4 nodes * 56 cores
# BUT we want to use only physical cores here for best speed
# So we have 4 nodes * 28 cores = 112 cores available
# As a test we'll process 1000 image sets
# Closest factor of 1000 to 112 threads is 100
# So we will run 100 batches, so each will deal with 10 image sets
# srun is used to distribute our cellprofiler batchs across the allocated nodes
# we run each bask as a background task (& at end of command)
# otherwise the script will wait for batches to complete 1 by 1
# Loop over i, which will be the first image set in the batch
# With 10000 image sets, and 100 batches this is a sequence from
# 1 to 991, in increments of 10
# You need to change the values in the seq command below for different
# project and batch sizes
for i in $(seq 1 10 991); do
# First image of the batch is our $i
BATCH_START=$i
# Last image is $i + 9 (10 images inclusive)
# e.g. if first in batch is 21, last is 30 (10 images)
# Change the 9 to match your real batch size
BATCH_END=$((i+9))
# Run CellProfiler for tha batch via srun
# Allocate 2 cpus per task because we want a physical core per process
# and there are 2 logical cores per physical core
# Specify JVM heap size as otherwize fails due to out of RAM
# with large numbers of batches per node
# -p specifies our batch file
# -c disables the GUI
# -r runs the batch immediately
# -b do not build extensions
# -f first image set for this batch
# -l last image set for this batch
# -o output folder for this batch
srun --exclusive -N1 -n1 --cpus-per-task=2 cellprofiler --jvm-heap-size=4G -p Batch_data.h5 -c -r -b -f ${BATCH_START} -l ${BATCH_END} -o batch_${i}_out &
done
# When we get here we've started all the batches running
# Now wait for all the batches we started to finish
wait
Once the batch job script is written and ready, submit it with the `sbatch` command, e.g.:
sbatch cellprofiler_4_node.sh
The output of the sbatch command is a job id, if your script is valid, or an error message if there is a mistake in your script.
You can check the status of your job with the `squeue -u $USER` command, which lists jobs on the cluster belonging to you.
As the job runs it will write log messages into a file called `slurm-<jobid>.out` where `<jobid>` is the job id you saw from the `sbatch` or `squeue`.
If you follow the script above, the output of each batch will be in a separate folder named e.g. batch_41_out
, where 41 is the index of the first image set in the batch.
You may wish to collect all the input into a single directory. If your output is uniquely named files per input (i.e. for each input you create a derivative image named similarly to the input file) you can use a command such as:
mkdir combined_output
cp -r batch_*_out/* combined_output/
However, if you generate e.g. summary CSV files then they will be named the same in each batch specific output directory, and you will need to merge the data inside the CSV files across batches. If you just copy the files as above then only the CSV file from the final batch will remain in your combined_output directory
.
It's possible to merge CSV files that have identical structure using Linux commands. If you generate a file called MyExpt_Nuclei.csv
from each batch run then you can create a merged file by:
# Take the header (first line) from the first batch
head -n1 batch_1_out/MyExpt_Nuclei.csv > Combined_Nuclei.csv
# Append all other lines except header from all files
tail -q -n+2 batch_*_out/MyExpt_Nuclei.csv >> Combined_Nuclei.csv