The Bioconductor project is a software ecosystem for the statistical analysis of large biological datasets in the R programming language. The project was initiated in 2001, and since then has grown tremendously in size and usage. Bioconductor provides software providing access to databases, data structures for analysis of high throughput data, as well as novel statistical methods of analysis of these data.
Accessing Bioconductor software on Nucleus is straightforward. All that is needed is a working R installation. On Nucleus, there are three main ways that users can run R: using the OnDemand RStudio app, through the module system, or via containers.
The RStudio OnDemand tool allows users to reserve a dedicated node on Nucleus, and serves the RStudio integrated development environment as a web app. This is the most straightforward way to access R/Bioconductor on Nucleus.
To access this, navigate to the BioHPC portal, and select "BioHPC OnDemand > OnDemand RStudio". The next page will allow you to select a partition to run on and launch a job right away, which you can connect to in your browser.
This approach is fast and easy, and provides a graphical environment to work from. However, jobs are capped at 20 hours, and users' ability to modify their environment in this setup is limited. If you need to have access to a shell or run a particularly long job, read on.
Users can also access R/Bioconductor by loading an R module, and then installing and loading software from Bioconductor. This can be done from a terminal in a WebVis session launched from the portal, or via an SSH connection. Ensure that you're not on a login node (Nucleus004 or Nucleus005) before launching a resource intensive job!
In a terminal, load the R or RStudio module corresponding to the version of R you wish to use, then launch R. For example, to launch version 4.4 of R, you could use the following commands in a terminal on a compute node on Nucleus:
module load R/4.4-img
R
This method provides the user some greater control over their environment, and is amenable to long running and/or batch jobs. However, users can only access the versions of R which are available as modules. Additionally, some packages which require compilation may fail to install if their dependencies cannot be met. This can be a challenge with some Bioconductor packages in particular.
Another way to access R/Bioconductor software on BioHPC is through containers. In this method, the user downloads and runs a pre-built container containing R and potentially some packages and their dependencies.
To do this, we can load the Singularity module to run containers, then obtain and run a container that includes the software that we want. Given that this guide is focused on Bioconductor, we can pull a container specifically designed to work with Bioconductor:
module load singularity/3.9.9
singularity run docker://bioconductor/bioconductor:RELEASE_3_21-R-4.5.0
This will launch an RStudio Server instance which you can connect to using your browser.
Once you've successfully launched R, you can access Bioconductor by installing the BiocManager
package:
install.packages("BiocManager")
Now, to install any package from Bioconductor, we can use BiocManager::install()
instead of the typical install.packages()
command. For example, to install the package DESeq2
from Bioconductor, we could use the following command:
BiocManager::install("DESeq2")
Here we provide some examples of software available on Bioconductor for analysis of particular types of data. Any of these can be installed by running BiocManager::install()
at an R prompt as described above.
Schematic of a SummarizedExperiment object. From SummarizedExperiment documentation.
Summary of scRNA analysis in Bioconductor. From Orchestrating Single Cell Analysis in Bioconductor.
SummarizedExperiment
class for single cell transcriptomic studies.
Example of aggregative relationship between assays in a QFeatures object. From R for Mass Spectrometry book.
Interaction between Bioconductor and other ecosystems is generally done through saving a file to disk that is compatible with both desired systems. Oftentimes domain- or task-specific data types have generally agreed upon file types that can be used. For example, FASTQ, BAM, and VCF files are de facto standards for short-read sequencing data, and can be used to import/export data from one system to another.
Here are a few specific examples and tips for interaction:
as.Seurat()
and as.SingleCellExperiment()
functions, respectively. Data can be saved as an h5ad file for use with Scanpy by using the zellkonverter, sceasy, or anndataR packages.system()
function within R allows users to run a shell command. Check the documentation for this function, as some operations may have unexpected results.The Bioconductor project provides numerous high quality bioinformatics software packages for several different domains. Many of these packages build off of common data structures, allowing for ease of use. Utilizing most Bioconductor packages on Nucleus is straightforward, but if you encounter any issue, please submit a ticket to biohpc-help@utsouthwestern.edu.