RNAseq analysis with Bioconductor

The Bioconductor project is a software ecosystem for the analysis of biological datasets in the R programming language. Bulk RNAseq data is one of the most frequently analyzed data types using Bioconductor software. Here, we will walk through a standard analysis procedure using this software running on Nucleus, the BioHPC compute cluster.

The steps we will follow are outlined below:

Quantify transcripts with Salmon
Import quantification results into R with tximeta
Differential expression analysis with DESeq2

This tutorial is closely modeled after this Bioconductor tutorial. If you'd like a more in-depth coverage of these tools, please visit the above document or the documentation for the tools discussed below.

Set up environment

This tutorial will provide instructions on how to run a standard bulk RNAseq analysis on Nucleus. In general these steps are not computationally demanding and can be run on a modern desktop or laptop computer. However, here we will assume that the raw data (FASTQ files) is already saved to BioHPC systems, and that the analysis will be taking place on Nucleus.

First, we must reserve a compute node on the cluster. Nucleus004 and Nucleus005 are login nodes, and intensive processes such as our analysis should be run on a compute node instead of a login node. Steps 1 and 2 are run outside of R/Bioconductor. For an easy way to reserve a compute node and get a graphical interface, you can reserve a WebVis session from the portal. This workflow requires less than 32GB of memory to run, so you can request a node from any available partition. However, these tools do not require a GPU, so you should not request a GPU session in order to reserve those resources for users who do require a GPU. Once you have a WebVis session started, you can open a terminal to run commands:

Step 3 is handled within R/Bioconductor. The easiest way to launch an interactive R session on a dedicated compute node is to use BioHPC's OnDemand RStudio service. Once you've loaded into a session, make sure that you have BiocManager, tximeta and DESeq2 installed. The following R code will install BiocManager if it is missing, then use it to install tximeta and DESeq2:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("tximeta", "DESeq2", "org.Hs.eg.db"))

Step 1: Download data

We need to download both the experimental data as well as a reference index to use for quantification. Begin by first creating a project directory to hold the data and analysis. Since these analyses can grow large in size, it is recommended that you do not use your home directory for this. Instead, consider creating a directory on /project. For example, you might choose to store your data for this tutorial in this location, where angle brackets and the text within them is replaced with values appropriate to you: /project/<department>/<group_name>/<s######>/rnaseq_tutorial.

For example, say the user s123456 who works in the Baj lab in the department of Bioinformatics would like to create the above directory and enter it. They could use the mkdir command to create the directory, and the cd command to change directory to that location:

mkdir /project/Bioinformatics/Baj_lab/s123456/rnaseq_tutorial
cd /project/Bioinformatics/Baj_lab/s123456/rnaseq_tutorial

Alternatively, in a WebVis, users can also use the graphical file explorer to navigate and create a directory, then right click the empty directory and select "Open in Terminal".

Experimental data

Note: if you want to analyze your own data, this step isn't necessary. You should already have FASTQ files available and can begin by obtaining a reference index below.

For this tutorial, we will use the classic airway dataset (Himes et al, 2014). This dataset was generated from an experiment where four primary human airway human smooth muscle cell lines were treated with 1 µM dexamethasone for 18 hours. For each line, there exists a treated and untreated sample, for eight total samples. These data are publically available from the NCBI Sequence Read Archive (SRA) and can be downloaded and unpacked with the tool fasterq-dump.

First, create a directory in the project directory to store the raw data (FASTQ files) and enter it using the same commands we used above:

mkdir fastq
cd fastq

Now, we'll use fasterq-dump to pull and convert the raw data from SRA with a single command. BioHPC has the SRA Toolkit, which contains fasterq-dump, installed as a module, so we don't need to install this software. Note that this process will take 20-30 minutes to complete, and requires about 100 GB of storage space.

# List runs to download
DATASETS="SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521"

# Load SRA Toolkit module
module load sra_toolkit/3.0.2

# Download and convert data
fasterq-dump ${DATASETS}

When this completes, you should have 20 FASTQ files in the fastq directory.

Reference index

Next, we need to obtain a reference index. Salmon will use this as a reference to assign sequencing reads in the FASTQ files to transcripts. Index files are pre-computed and available for download for common species from refgenie. If you're working with an uncommon organism, you may need to generate your own index. You can find instructions for that here.

Since we're using data generated from human cells, we can use the pre-computed index from refgenie. Note that there are several "flavors" of index available; here we use one that is balanced between mapping accuracy and size.

# Download hg38 index
wget -O salmon_partial_sa_index.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/salmon_partial_sa_index?tag=default

# Decompress the index
tar -xzvf salmon_partial_sa_index.tgz

# Rename the index directory
mv default salmon_partial_sa_index_hg38

Now we're ready to quantify!

Step 2: Quantification

The first step in the analysis is to assign sequencing reads to features in the transcriptome. Modern software to handle this task is designed to be "transcript-aware", which means that it will attempt to assign reads to transcripts instead of genes. The benefit of this is that it allows for differential isoform expression analysis, while still permitting users to collapse the quantification into gene-level results if they're not interested in these comparisons.

Here, we'll use a tool named Salmon to perform quantification. This tool uses quasi-mapping to quantify count estimates from RNAseq data. Because this tool doesn't actually perform alignment, it can operate with very low memory requirements and is incredibly fast. However, it will not generate typical outputs from classical alignment (eg, BAM files) and may not be suitable for analyses where identifying chromosomal cooordinates of reads is essential. If your analysis requires these aspects, consider classical alignment with a tool like STAR instead.

Quantifying with Salmon

Salmon exists as a module pre-installed on Nucleus. However, the version installed on Nucleus is quite old as of the time of writing. Instead, we'll use a Singularity container to run a newer version of Salmon to take advantage of modern improvements.

# Load Singularity module
module load singularity/3.9.9

# Create output directory
mkdir quants

# Run Salmon for each pair of FASTQ files
for s in ${DATASETS}; do
    singularity run docker://combinelab/salmon:1.10.3 \
        salmon quant \
        -i salmon_partial_sa_index_hg38 \
        -l A \
        -p 16 \
        --gcBias \
        -o quants/${s} \
        -1 fastq/${s}_1.fastq \
        -2 fastq/${s}_2.fastq
done

Let's take a minute to review the call to Salmon here:

We're running Salmon in a container using Singularity, so we use singularity run {...} to specify and launch the container
We launch the Salmon software and select the quantification process with salmon quant
-i specifies where the index is located
-l A specifies that we want Salmon to automatically detect our library type
-p 16 specifies that we want Salmon to use 16 cores for parallel processing to speed up the results
--gcBias specifies that we want Salmon to attempt to correct for any GC bias present (this is common in RNAseq data)
-o {...} specifies where want the output saved. In this case, the output will saved in sample-specific directories under the quants directory that we created above
-1 {...} specifies where to find the first FASTQ file per sample
-2 {...} specifies where to find the second FASTQ file per sample

This entire process is wrapped in a for loop to run this command on every sample specified in DATASETS above. This entire loop should take about 15 minutes to complete, with a total output size of about 70 MB.

Note: if you're analyzing your own data, you'll need to modify the above to make sure that you're passing in the correct file names. You can update the DATASETS variable to match your file names, and/or modify the arguments passed to the -1 and -2 parameters to match your file names.

Quantification output

Salmon outputs several files, but the quantification results are stored in the quant.sf file in each directory. Here's a few rows of an example quant.sf file:

Name	Length	EffectiveLength	TPM	NumReads
ENST00000631435.1	12	12.000	0.000000	0.000000
ENST00000415118.1	8	7.000	0.000000	0.000000
ENST00000448914.1	13	13.000	0.000000	0.000000
ENST00000645792.1	930	797.977	0.396113	4.000
ENST00000560634.1	4910	4867.902	2.540374	156.491

Some notes to consider here:

Name: This provides ENSEMBL transcript identifiers (as denoted by the ENST prefix) instead of gene identifiers (which would have the ENSG prefix). This enables a transcript-level comparison downstream. To perform a gene-level comparison instead, all given transcripts of a gene will need to be collapsed into their corresponding genes (which we will demonstrate in Step 3).
Length: This is the length of the transcript in base pairs.
EffectiveLength: This is the effective length of the transcript after taking into account all features being modeled, such as GC bias.
TPM: An estimate of relative abundance of a transcript expressed as Transcripts Per Million (TPM).
NumReads: Salmon's estimate of the number of reads that mapped to the transcript.

There are also other files output by Salmon. We won't cover them here, but more detailed description of the other files is present in the documentation here. It is important to note that our downstream tools will anticipate these files to be present with their original names, so once these files are generated, do not move or rename them.

Step 3: Differential expression analysis

Loading data

Now that estimates of abundance have been calculated, these can be used for differential expression analysis. This allows us to ask the question which genes are expressed at different levels in one group compared to another? To do this, we'll use tools in the R/Bioconductor project for statistical modeling and analysis. Before we can do that, we first need to load the data into R.

Launch an OnDemand RStudio session from the BioHPC Portal, and set the working directory to your project directory using the file explorer in the bottom right pane:

Click the ... button on the right side (red arrow in the image below).
Enter the path to your project directory. For example, /project/<department>/<group_name>/<s######>/rnaseq_tutorial, where you've replaced <department>, <group_name>, and <s######> with the appropriate values for your project.
Click "More" -> "Set as working directory" (yellow arrow in the image below).

You should now see the contents of your project directory in the file explorer. Now that we've navigated to our project directory, we need to load the quantification results from Salmon. We can use the tximeta package to do so.

Begin by loading the packages that we need. Type the following into the console:

library(tximeta)
library(DESeq2)
library(AnnotationDbi)
library(org.Hs.eg.db)

This will load the import package tximeta, as well as the analysis package DESeq2, which we'll use later. If you receive an error similar to this: there is no package called 'tximeta', then you need to install these packages. You can do so by following the environment setup instructions at the beginning of this document.

We'll also need a sample sheet that provides additional information about the experiment. You can download a CSV copy of the sample sheet for this dataset from Bioconductor here. Save the file to your project directory- for this tutorial we'll name the file sample_table.csv. Then, we can load it into R using the read.csv() function. We also set the stringsAsFactors parameter to TRUE to make sure that we convert grouping columns to factors, which will make downstream analysis easier.

coldata <- read.csv("sample_table.csv", stringsAsFactors = TRUE)
coldata

  SampleName    cell   dex albut        Run avgLength Experiment    Sample    BioSample
1 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345 SRS508568 SAMN02422669
2 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346 SRS508567 SAMN02422675
3 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349 SRS508571 SAMN02422678
4 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350 SRS508572 SAMN02422670
5 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353 SRS508575 SAMN02422682
6 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354 SRS508576 SAMN02422673

Note that this file contains one row per sample, with sample level metadata provided, including cell line, dexamethasone treatment, and several SRA identifiers. tximeta can read use this file to understand what needs to be imported, but it requires that the following columns exist:

names: unique sample identifiers
files: locations of quantification files

These columns don't yet exist in this table, but we can easily add them. You could do this manually in a text editor or spreadsheet software like Excel, but this is also relatively simple to do in R. Here we'll use the Run column to create unique sample names, and create a files column to point towards the individual quant.sf files.

coldata$names <- coldata$Run
coldata$files <- file.path("quants", coldata$names, "quant.sf")
coldata

  SampleName    cell   dex albut        Run avgLength Experiment    Sample    BioSample      names                      files
1 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345 SRS508568 SAMN02422669 SRR1039508 quants/SRR1039508/quant.sf
2 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346 SRS508567 SAMN02422675 SRR1039509 quants/SRR1039509/quant.sf
3 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349 SRS508571 SAMN02422678 SRR1039512 quants/SRR1039512/quant.sf
4 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350 SRS508572 SAMN02422670 SRR1039513 quants/SRR1039513/quant.sf
5 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353 SRS508575 SAMN02422682 SRR1039516 quants/SRR1039516/quant.sf
6 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354 SRS508576 SAMN02422673 SRR1039517 quants/SRR1039517/quant.sf
7 GSM1275874 N061011 untrt untrt SRR1039520       101  SRX384357 SRS508579 SAMN02422683 SRR1039520 quants/SRR1039520/quant.sf
8 GSM1275875 N061011   trt untrt SRR1039521        98  SRX384358 SRS508580 SAMN02422677 SRR1039521 quants/SRR1039521/quant.sf

We can now load the quantification data using this table and tximeta():

se <- tximeta(coldata)

This command will load the data and attempt to identify and download the relevant transcriptome annotations. It may ask you whether you'd like to create a cache directory. You can answer "Yes (use default)" if you expect to be doing this type of analysis often, and would like to skip the download step each time. Alternatively, you may answer "No (use temp)" if not- this will let you download the annotations fresh each time you need them, and save a bit of storage space in your home directory.

Let's check the structure of the object that we've just created:

se

class: RangedSummarizedExperiment 
dim: 177456 8 
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(177456): ENST00000631435 ENST00000415118 ... ENST00000646356 ENST00000645792
rowData names(8): tx_id tx_biotype ... tx_id_version tx_name
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(10): SampleName cell ... BioSample names

tximeta() outputs a SummarizedExperiment object by default. You can learn more about what this means here, but for now, we just need to know that this object contains both the quantification data and the metadata for all 8 samples. Additionally, we can confirm that the rownames are transcripts (ENST...) instead of genes (ENSG...), as we saw above when examining the Salmon quant files. Also note that the colData names reflect the columns that were present in the sample sheet. In fact, the full sample sheet (excluding the file names) is still stored in the SummarizedExperiment object. We can confirm that with the colData() function:

colData(se)

DataFrame with 8 rows and 10 columns
           SampleName     cell      dex    albut        Run avgLength Experiment    Sample    BioSample      names
             <factor> <factor> <factor> <factor>   <factor> <integer>   <factor>  <factor>     <factor>   <factor>
SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126  SRX384345 SRS508568 SAMN02422669 SRR1039508
SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126  SRX384346 SRS508567 SAMN02422675 SRR1039509
SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126  SRX384349 SRS508571 SAMN02422678 SRR1039512
SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87  SRX384350 SRS508572 SAMN02422670 SRR1039513
SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120  SRX384353 SRS508575 SAMN02422682 SRR1039516
SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126  SRX384354 SRS508576 SAMN02422673 SRR1039517
SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101  SRX384357 SRS508579 SAMN02422683 SRR1039520
SRR1039521 GSM1275875  N061011    trt      untrt SRR1039521        98  SRX384358 SRS508580 SAMN02422677 SRR1039521

For this analysis, we want to generate a list of differentially expressed genes instead of transcripts. let's collapse the transcripts down to the gene level to enable that.

gse <- summarizeToGene(se)
gse

class: RangedSummarizedExperiment 
dim: 37951 8 
metadata(7): tximetaInfo quantInfo ... txdbInfo assignRanges
assays(3): counts abundance length
rownames(37951): ENSG00000000003 ENSG00000000005 ... ENSG00000288031 ENSG00000288053
rowData names(9): gene_id gene_name ... entrezid tx_ids
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(10): SampleName cell ... BioSample names

Now we can see that we have many fewer entries (37951 genes vs 177456 transcripts), and that the rownames are all prefixed with ENSG instead of ENST. Also note that the metadata is preserved.

We now have what we need to perform differential expression analysis.

Differential expression analysis

Bioconductor contains many packages that allows for differential expression analysis (DEA). One of the most popular ones is DESeq2, which we'll use here.

Creating a DESeqDataSet

The first step is deciding on our design. In this context, the design refers to which variables we want to include in our analysis, and how we want to consider them. Recall that this dataset is composed of cell lines that were treated with dexamethasone or vehicle. The treatment status is outlined in the dex column of our metadata, "untrt" for untreated and "trt" for treated. Our analysis depends on the "control" value being the first level; since we want "untrt" to be considered a control, we need to make sure that it comes first in the ordering.

levels(gse$dex)

[1] "trt"   "untrt"

Since our control group "untrt" isn't the first level, we need to change this.

gse$dex <- relevel(gse$dex, "untrt")
levels(gse$dex)

[1] "untrt" "trt"

Now we have our control group set as the first level.

Looking at the metadata, the cell line used is listed in the cell column. This example has 4 different cell lines used. Since comparing different cell lines may confound the analysis of the effects of dexamethasone, we will want to control for this variable in our design.

With our SummarizedExperiment object and design in hand, we can now create a DESeqDataSet object.

dds <- DESeqDataSet(gse, design = ~ cell + dex)

Filtering the data

First we will filter genes which are not or very infrequently detected. This will help to speed up the analysis by preventing comparisons for genes which are barely detected anyway. A general recommendation from the DESeq2 authors is to keep genes which have at least 10 counts for a minimum number of samples (defined as the smallest group size). Here, we have two groups with 4 sample each, so our minimum number of samples is 4.

nrow_prefilter <- nrow(dds)
keep <- rowSums(counts(dds) >= 10) >= 4
dds <- dds[keep,]
cat("Number of genes before filtering:", nrow_prefilter,
    "\nNumber of genes after filtering:", nrow(dds))

Number of genes before filtering:  37951
Number of genes after filtering:  14962

This filter removed about 23,000 genes.

Data exploration

There are many ways to explore the data prior to running statistical test. Here, we'll use principal components analysis (PCA) to qualitatively explore clustering of samples. There are many other ways to explore the data- you can see an example of other methods on the tutorial that inspired this one here.

Before we do any exploration, we need to address the data heteroscedasticity. This means that the variance of the data is not constant relative to the values of the data itself. Most exploratory data tools operate on the assumption that the variance is stable, but that is not generally true of count data, which includes RNAseq. DESeq2 provides transformation tools to address this. There are two options, variance-stabilizing transform (VST) and regularized-logarithm transformation (rlog). The DESeq2 authors give the following recommendations for choosing between them: - VST is faster than rlog, so may be preferred when analyzing large datasets (30+ samples) - VST is less sensitive to high count outliers than rlog - rlog may be preferred when there is a large difference in library size across samples (eg, an order of magnitude or more)

We can check the library size of each sample after filtering:

colSums(counts(dds))

SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521 
  21108721   19351430   26062350   15685250   25142569   31799563   19656881   21815946

Since we do not have a large difference in library size across samples, we will use VST here. Because the statistical tests that we will run later in this tutorial require raw (not transformed) counts, we will be careful not to overwrite the raw counts.

vsd <- vst(dds, blind = FALSE)
vsd

class: DESeqTransform 
dim: 14962 8 
metadata(8): tximetaInfo quantInfo ... assignRanges version
assays(1): ''
rownames(14962): ENSG00000000003 ENSG00000000419 ... ENSG00000287977 ENSG00000288031
rowData names(13): gene_id gene_name ... allZero dispFit
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(10): SampleName cell ... BioSample names

Here, we set blind = FALSE to allow the algorithm to use the design that we established earlier to estimate group variances instead of an overall variance. Now that the data have been transformed, we can create a PCA plot to examine the data, grouped by dex:

plotPCA(vsd, intgroup = "dex")

Now do the same, but group by cell:

plotPCA(vsd, intgroup = "cell")

Note how the dex grouping seems to separate the data in PCA space well, while the cell grouping does not. This is a good sign that most of the transcriptional variance that was captured is due to treatment, and not cell line.

Differential expression analysis

Now, we will use DESeq2 to perform differential expression analysis. These statistical tests will tell us which genes, if any, are more highly expressed in one group relative to another.

We need to use raw, not transformed, counts for this step. Therefore, we'll go back to our dds object instead of vsd. To run the analysis, we use the DESeq() function.

dds <- DESeq(dds)
dds

class: DESeqDataSet 
dim: 14962 8 
metadata(8): tximetaInfo quantInfo ... assignRanges version
assays(7): counts abundance ... H cooks
rownames(14962): ENSG00000000003 ENSG00000000419 ... ENSG00000287977 ENSG00000288031
rowData names(43): gene_id gene_name ... deviance maxCooks
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(10): SampleName cell ... BioSample names

This command updates our dds object to add the results of the differential expression tests. We can summarize the results of those tests by using the results() function:

res <- results(dds)
res

log2 fold change (MLE): dex trt vs untrt 
Wald test p-value: dex trt vs untrt 
DataFrame with 14962 rows and 6 columns
                 baseMean log2FoldChange     lfcSE         stat      pvalue        padj
                <numeric>      <numeric> <numeric>    <numeric>   <numeric>   <numeric>
ENSG00000000003  745.4556    -0.36829304 0.1011509 -3.641024972 2.71555e-04 1.99498e-03
ENSG00000000419  514.0339     0.19272943 0.1256458  1.533911213 1.25051e-01 2.95860e-01
ENSG00000000457  316.1027    -0.00012527 0.1563233 -0.000801355 9.99361e-01 9.99774e-01
ENSG00000000460   93.5087    -0.19217432 0.2820222 -0.681415452 4.95609e-01 7.05313e-01
ENSG00000000971 5816.2774     0.42807726 0.0859304  4.981675143 6.30362e-07 8.65921e-06
...                   ...            ...       ...          ...         ...         ...
ENSG00000287725   24.8133     -0.5503712  0.665001    -0.827625    0.407883    0.629768
ENSG00000287856   37.8949     -0.5059185  0.412521    -1.226408    0.220045    0.431877
ENSG00000287908   18.2239      0.0930476  0.643105     0.144685    0.884960    0.948305
ENSG00000287977   11.6876      0.5146171  0.673482     0.764115    0.444799    0.662099
ENSG00000288031   13.1838      0.1969163  0.631894     0.311629    0.755323    0.880450

The output of results() is a DataFrame with several entries: - baseMean: the normalized average count value of all samples present for this gene - log2FoldChange: the log2-transformed effect size estimate - lfcSE: the standard error estimate of the log2FoldChange - stat: the calculated Wald statistic - pvalue: the Wald test p-value - padj: the BH-adjusted p-value

It is important to note that by default, results() will return the log2 fold change and p-values for the last entry in the design formula. Recall that when we first created our DESeqDataSet object, our design was ~ cell + dex. Therefore, the changes that we see here are specifically for changes in dex. We also set the first factor level of dex to be "untrt", so that it's considered as the reference value. We can confirm all of this based on the first two lines of this output, which reveals that our comparison here is "dex trt vs untrt". For a more complex analysis, you can provide the comparison more specifically as a contrast. See the DESeq2 documentation for more information on contrasts.

You can also generate a short summary of the results using the summary() function:

summary(res)

out of 14962 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 2258, 15%
LFC < 0 (down)     : 2012, 13%
outliers [1]       : 0, 0%
low counts [2]     : 291, 1.9%
(mean count < 11)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

We can see that with an adjusted p-value cutoff of 0.1 and log2 fold change cutoff of 0, 15% of the tested genes were found to be upregulated in the dexamethasone-treated group, while 13% were found to be downregulated. We may choose to be more (or less) stringent with our cutoffs, which would impact the number of genes found to be differentially regulated.

Adding gene symbols

The above results returns genes as ENSEMBL IDs, but it is often more informative to have gene symbols present. We can map these IDs to their matching gene symbols using the AnnotationDbi package.

ens.str <- substr(rownames(res), 1, 15)
res$symbol <- mapIds(org.Hs.eg.db,
                     keys=ens.str,
                     column="SYMBOL",
                     keytype="ENSEMBL",
                     multiVals="first")
res

log2 fold change (MLE): dex trt vs untrt 
Wald test p-value: dex trt vs untrt 
DataFrame with 14962 rows and 7 columns
                 baseMean log2FoldChange     lfcSE         stat      pvalue        padj      symbol
                <numeric>      <numeric> <numeric>    <numeric>   <numeric>   <numeric> <character>
ENSG00000000003  745.4556    -0.36829304 0.1011509 -3.641024972 2.71555e-04 1.99498e-03      TSPAN6
ENSG00000000419  514.0339     0.19272943 0.1256458  1.533911213 1.25051e-01 2.95860e-01        DPM1
ENSG00000000457  316.1027    -0.00012527 0.1563233 -0.000801355 9.99361e-01 9.99774e-01       SCYL3
ENSG00000000460   93.5087    -0.19217432 0.2820222 -0.681415452 4.95609e-01 7.05313e-01       FIRRM
ENSG00000000971 5816.2774     0.42807726 0.0859304  4.981675143 6.30362e-07 8.65921e-06         CFH
...                   ...            ...       ...          ...         ...         ...         ...
ENSG00000287725   24.8133     -0.5503712  0.665001    -0.827625    0.407883    0.629768          NA
ENSG00000287856   37.8949     -0.5059185  0.412521    -1.226408    0.220045    0.431877          NA
ENSG00000287908   18.2239      0.0930476  0.643105     0.144685    0.884960    0.948305          NA
ENSG00000287977   11.6876      0.5146171  0.673482     0.764115    0.444799    0.662099          NA
ENSG00000288031   13.1838      0.1969163  0.631894     0.311629    0.755323    0.880450          NA

Now, res has a column named symbol that provides the gene symbol for each ENSEMBL gene ID when available. If no symbol is available, this column will instead have NA.

Writing results

Finally, we can write the results to a CSV file for export:

write.csv(res, "DEA_results.csv)

Summary

In this tutorial, we have demonstrated how to acquire raw sequencing data from SRA, how to map RNAseq to the human transcriptome using Salmon, how to import those quantifications into R, and how to use the DESeq2 package in Bioconductor to perform an exploratory data analysis and differential expression analysis. This was a brief introduction - there exist many other software packages and analysis options for this type of analysis, but this should provide a solid start.

RNAseq analysis with Bioconductor

Set up environment

Step 1: Download data

Experimental data

Reference index

Step 2: Quantification

Quantifying with Salmon

Quantification output

Step 3: Differential expression analysis

Loading data

Differential expression analysis

Creating a DESeqDataSet

Filtering the data

Data exploration

Differential expression analysis

Adding gene symbols

Writing results

Summary

Further reading