The Bioconductor project is a software ecosystem for the analysis of biological datasets in the R programming language. Bulk RNAseq data is one of the most frequently analyzed data types using Bioconductor software. Here, we will walk through a standard analysis procedure using this software running on Nucleus, the BioHPC compute cluster.
The steps we will follow are outlined below:
This tutorial is closely modeled after this Bioconductor tutorial. If you'd like a more in-depth coverage of these tools, please visit the above document or the documentation for the tools discussed below.
This tutorial will provide instructions on how to run a standard bulk RNAseq analysis on Nucleus. In general these steps are not computationally demanding and can be run on a modern desktop or laptop computer. However, here we will assume that the raw data (FASTQ files) is already saved to BioHPC systems, and that the analysis will be taking place on Nucleus.
First, we must reserve a compute node on the cluster. Nucleus004 and Nucleus005 are login nodes, and intensive processes such as our analysis should be run on a compute node instead of a login node. Steps 1 and 2 are run outside of R/Bioconductor. For an easy way to reserve a compute node and get a graphical interface, you can reserve a WebVis session from the portal. This workflow requires less than 32GB of memory to run, so you can request a node from any available partition. However, these tools do not require a GPU, so you should not request a GPU session in order to reserve those resources for users who do require a GPU. Once you have a WebVis session started, you can open a terminal to run commands:
Step 3 is handled within R/Bioconductor. The easiest way to launch an interactive R session on a dedicated compute node is to use BioHPC's OnDemand RStudio service. Once you've loaded into a session, make sure that you have BiocManager
, tximeta
and DESeq2
installed. The following R code will install BiocManager
if it is missing, then use it to install tximeta
and DESeq2
:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("tximeta", "DESeq2", "org.Hs.eg.db"))
We need to download both the experimental data as well as a reference index to use for quantification. Begin by first creating a project directory to hold the data and analysis. Since these analyses can grow large in size, it is recommended that you do not use your home directory for this. Instead, consider creating a directory on /project
. For example, you might choose to store your data for this tutorial in this location, where angle brackets and the text within them is replaced with values appropriate to you: /project/<department>/<group_name>/<s######>/rnaseq_tutorial
.
For example, say the user s123456 who works in the Baj lab in the department of Bioinformatics would like to create the above directory and enter it. They could use the mkdir
command to create the directory, and the cd
command to change directory to that location:
mkdir /project/Bioinformatics/Baj_lab/s123456/rnaseq_tutorial
cd /project/Bioinformatics/Baj_lab/s123456/rnaseq_tutorial
Alternatively, in a WebVis, users can also use the graphical file explorer to navigate and create a directory, then right click the empty directory and select "Open in Terminal".
Note: if you want to analyze your own data, this step isn't necessary. You should already have FASTQ files available and can begin by obtaining a reference index below.
For this tutorial, we will use the classic airway dataset (Himes et al, 2014). This dataset was generated from an experiment where four primary human airway human smooth muscle cell lines were treated with 1 µM dexamethasone for 18 hours. For each line, there exists a treated and untreated sample, for eight total samples. These data are publically available from the NCBI Sequence Read Archive (SRA) and can be downloaded and unpacked with the tool fasterq-dump
.
First, create a directory in the project directory to store the raw data (FASTQ files) and enter it using the same commands we used above:
mkdir fastq
cd fastq
Now, we'll use fasterq-dump
to pull and convert the raw data from SRA with a single command. BioHPC has the SRA Toolkit, which contains fasterq-dump
, installed as a module, so we don't need to install this software. Note that this process will take 20-30 minutes to complete, and requires about 100 GB of storage space.
# List runs to download
DATASETS="SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521"
# Load SRA Toolkit module
module load sra_toolkit/3.0.2
# Download and convert data
fasterq-dump ${DATASETS}
When this completes, you should have 20 FASTQ files in the fastq
directory.
Next, we need to obtain a reference index. Salmon will use this as a reference to assign sequencing reads in the FASTQ files to transcripts. Index files are pre-computed and available for download for common species from refgenie. If you're working with an uncommon organism, you may need to generate your own index. You can find instructions for that here.
Since we're using data generated from human cells, we can use the pre-computed index from refgenie. Note that there are several "flavors" of index available; here we use one that is balanced between mapping accuracy and size.
# Download hg38 index
wget -O salmon_partial_sa_index.tgz http://refgenomes.databio.org/v3/assets/archive/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/salmon_partial_sa_index?tag=default
# Decompress the index
tar -xzvf salmon_partial_sa_index.tgz
# Rename the index directory
mv default salmon_partial_sa_index_hg38
Now we're ready to quantify!
The first step in the analysis is to assign sequencing reads to features in the transcriptome. Modern software to handle this task is designed to be "transcript-aware", which means that it will attempt to assign reads to transcripts instead of genes. The benefit of this is that it allows for differential isoform expression analysis, while still permitting users to collapse the quantification into gene-level results if they're not interested in these comparisons.
Here, we'll use a tool named Salmon to perform quantification. This tool uses quasi-mapping to quantify count estimates from RNAseq data. Because this tool doesn't actually perform alignment, it can operate with very low memory requirements and is incredibly fast. However, it will not generate typical outputs from classical alignment (eg, BAM files) and may not be suitable for analyses where identifying chromosomal cooordinates of reads is essential. If your analysis requires these aspects, consider classical alignment with a tool like STAR instead.
Salmon exists as a module pre-installed on Nucleus. However, the version installed on Nucleus is quite old as of the time of writing. Instead, we'll use a Singularity container to run a newer version of Salmon to take advantage of modern improvements.
# Load Singularity module
module load singularity/3.9.9
# Create output directory
mkdir quants
# Run Salmon for each pair of FASTQ files
for s in ${DATASETS}; do
singularity run docker://combinelab/salmon:1.10.3 \
salmon quant \
-i salmon_partial_sa_index_hg38 \
-l A \
-p 16 \
--gcBias \
-o quants/${s} \
-1 fastq/${s}_1.fastq \
-2 fastq/${s}_2.fastq
done
Let's take a minute to review the call to Salmon here: - We're running Salmon in a container using Singularity, so we use singularity run {...}
to specify and launch the container - We launch the Salmon software and select the quantification process with salmon quant
- -i
specifies where the index is located - -l A
specifies that we want Salmon to automatically detect our library type - -p 16
specifies that we want Salmon to use 16 cores for parallel processing to speed up the results - --gcBias
specifies that we want Salmon to attempt to correct for any GC bias present (this is common in RNAseq data) - -o {...}
specifies where want the output saved. In this case, the output will saved in sample-specific directories under the quants
directory that we created above - -1 {...}
specifies where to find the first FASTQ file per sample - -2 {...}
specifies where to find the second FASTQ file per sample
This entire process is wrapped in a for loop to run this command on every sample specified in DATASETS above. This entire loop should take about 15 minutes to complete, with a total output size of about 70 MB.
Note: if you're analyzing your own data, you'll need to modify the above to make sure that you're passing in the correct file names. You can update the
DATASETS
variable to match your file names, and/or modify the arguments passed to the-1
and-2
parameters to match your file names.
Salmon outputs several files, but the quantification results are stored in the quant.sf
file in each directory. Here's a few rows of an example quant.sf
file:
Name | Length | EffectiveLength | TPM | NumReads |
---|---|---|---|---|
ENST00000631435.1 | 12 | 12.000 | 0.000000 | 0.000000 |
ENST00000415118.1 | 8 | 7.000 | 0.000000 | 0.000000 |
ENST00000448914.1 | 13 | 13.000 | 0.000000 | 0.000000 |
ENST00000645792.1 | 930 | 797.977 | 0.396113 | 4.000 |
ENST00000560634.1 | 4910 | 4867.902 | 2.540374 | 156.491 |
Some notes to consider here:
Name
: This provides ENSEMBL transcript identifiers (as denoted by the ENST prefix) instead of gene identifiers (which would have the ENSG prefix). This enables a transcript-level comparison downstream. To perform a gene-level comparison instead, all given transcripts of a gene will need to be collapsed into their corresponding genes (which we will demonstrate in Step 3).Length
: This is the length of the transcript in base pairs.EffectiveLength
: This is the effective length of the transcript after taking into account all features being modeled, such as GC bias.TPM
: An estimate of relative abundance of a transcript expressed as Transcripts Per Million (TPM).NumReads
: Salmon's estimate of the number of reads that mapped to the transcript.There are also other files output by Salmon. We won't cover them here, but more detailed description of the other files is present in the documentation here. It is important to note that our downstream tools will anticipate these files to be present with their original names, so once these files are generated, do not move or rename them.
Now that estimates of abundance have been calculated, these can be used for differential expression analysis. This allows us to ask the question which genes are expressed at different levels in one group compared to another? To do this, we'll use tools in the R/Bioconductor project for statistical modeling and analysis. Before we can do that, we first need to load the data into R.
Launch an OnDemand RStudio session from the BioHPC Portal, and set the working directory to your project directory using the file explorer in the bottom right pane:
...
button on the right side (red arrow in the image below)./project/<department>/<group_name>/<s######>/rnaseq_tutorial
, where you've replaced <department>
, <group_name>
, and <s######>
with the appropriate values for your project.
You should now see the contents of your project directory in the file explorer. Now that we've navigated to our project directory, we need to load the quantification results from Salmon. We can use the tximeta
package to do so.
Begin by loading the packages that we need. Type the following into the console:
library(tximeta)
library(DESeq2)
library(AnnotationDbi)
library(org.Hs.eg.db)
This will load the import package tximeta
, as well as the analysis package DESeq2
, which we'll use later. If you receive an error similar to this: there is no package called 'tximeta'
, then you need to install these packages. You can do so by following the environment setup instructions at the beginning of this document.
We'll also need a sample sheet that provides additional information about the experiment. You can download a CSV copy of the sample sheet for this dataset from Bioconductor here. Save the file to your project directory- for this tutorial we'll name the file sample_table.csv
. Then, we can load it into R using the read.csv()
function. We also set the stringsAsFactors
parameter to TRUE
to make sure that we convert grouping columns to factors, which will make downstream analysis easier.
coldata <- read.csv("sample_table.csv", stringsAsFactors = TRUE)
coldata
SampleName cell dex albut Run avgLength Experiment Sample BioSample
1 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568 SAMN02422669
2 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567 SAMN02422675
3 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349 SRS508571 SAMN02422678
4 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572 SAMN02422670
5 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353 SRS508575 SAMN02422682
6 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354 SRS508576 SAMN02422673
Note that this file contains one row per sample, with sample level metadata provided, including cell line, dexamethasone treatment, and several SRA identifiers. tximeta
can read use this file to understand what needs to be imported, but it requires that the following columns exist:
names
: unique sample identifiersfiles
: locations of quantification filesThese columns don't yet exist in this table, but we can easily add them. You could do this manually in a text editor or spreadsheet software like Excel, but this is also relatively simple to do in R. Here we'll use the Run
column to create unique sample names, and create a files
column to point towards the individual quant.sf
files.
coldata$names <- coldata$Run
coldata$files <- file.path("quants", coldata$names, "quant.sf")
coldata
SampleName cell dex albut Run avgLength Experiment Sample BioSample names files
1 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568 SAMN02422669 SRR1039508 quants/SRR1039508/quant.sf
2 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567 SAMN02422675 SRR1039509 quants/SRR1039509/quant.sf
3 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349 SRS508571 SAMN02422678 SRR1039512 quants/SRR1039512/quant.sf
4 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572 SAMN02422670 SRR1039513 quants/SRR1039513/quant.sf
5 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353 SRS508575 SAMN02422682 SRR1039516 quants/SRR1039516/quant.sf
6 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354 SRS508576 SAMN02422673 SRR1039517 quants/SRR1039517/quant.sf
7 GSM1275874 N061011 untrt untrt SRR1039520 101 SRX384357 SRS508579 SAMN02422683 SRR1039520 quants/SRR1039520/quant.sf
8 GSM1275875 N061011 trt untrt SRR1039521 98 SRX384358 SRS508580 SAMN02422677 SRR1039521 quants/SRR1039521/quant.sf
We can now load the quantification data using this table and tximeta()
:
se <- tximeta(coldata)
This command will load the data and attempt to identify and download the relevant transcriptome annotations. It may ask you whether you'd like to create a cache directory. You can answer "Yes (use default)" if you expect to be doing this type of analysis often, and would like to skip the download step each time. Alternatively, you may answer "No (use temp)" if not- this will let you download the annotations fresh each time you need them, and save a bit of storage space in your home directory.
Let's check the structure of the object that we've just created:
se
class: RangedSummarizedExperiment
dim: 177456 8
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(177456): ENST00000631435 ENST00000415118 ... ENST00000646356 ENST00000645792
rowData names(8): tx_id tx_biotype ... tx_id_version tx_name
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(10): SampleName cell ... BioSample names
tximeta()
outputs a SummarizedExperiment object by default. You can learn more about what this means here, but for now, we just need to know that this object contains both the quantification data and the metadata for all 8 samples. Additionally, we can confirm that the rownames are transcripts (ENST...) instead of genes (ENSG...), as we saw above when examining the Salmon quant files. Also note that the colData names
reflect the columns that were present in the sample sheet. In fact, the full sample sheet (excluding the file names) is still stored in the SummarizedExperiment object. We can confirm that with the colData()
function:
colData(se)
DataFrame with 8 rows and 10 columns
SampleName cell dex albut Run avgLength Experiment Sample BioSample names
<factor> <factor> <factor> <factor> <factor> <integer> <factor> <factor> <factor> <factor>
SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568 SAMN02422669 SRR1039508
SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567 SAMN02422675 SRR1039509
SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349 SRS508571 SAMN02422678 SRR1039512
SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572 SAMN02422670 SRR1039513
SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353 SRS508575 SAMN02422682 SRR1039516
SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354 SRS508576 SAMN02422673 SRR1039517
SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 SRX384357 SRS508579 SAMN02422683 SRR1039520
SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 SRX384358 SRS508580 SAMN02422677 SRR1039521
For this analysis, we want to generate a list of differentially expressed genes instead of transcripts. let's collapse the transcripts down to the gene level to enable that.
gse <- summarizeToGene(se)
gse
class: RangedSummarizedExperiment
dim: 37951 8
metadata(7): tximetaInfo quantInfo ... txdbInfo assignRanges
assays(3): counts abundance length
rownames(37951): ENSG00000000003 ENSG00000000005 ... ENSG00000288031 ENSG00000288053
rowData names(9): gene_id gene_name ... entrezid tx_ids
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(10): SampleName cell ... BioSample names
Now we can see that we have many fewer entries (37951 genes vs 177456 transcripts), and that the rownames are all prefixed with ENSG instead of ENST. Also note that the metadata is preserved.
We now have what we need to perform differential expression analysis.
Bioconductor contains many packages that allows for differential expression analysis (DEA). One of the most popular ones is DESeq2, which we'll use here.
The first step is deciding on our design. In this context, the design refers to which variables we want to include in our analysis, and how we want to consider them. Recall that this dataset is composed of cell lines that were treated with dexamethasone or vehicle. The treatment status is outlined in the dex
column of our metadata, "untrt" for untreated and "trt" for treated. Our analysis depends on the "control" value being the first level; since we want "untrt" to be considered a control, we need to make sure that it comes first in the ordering.
levels(gse$dex)
[1] "trt" "untrt"
Since our control group "untrt" isn't the first level, we need to change this.
gse$dex <- relevel(gse$dex, "untrt")
levels(gse$dex)
[1] "untrt" "trt"
Now we have our control group set as the first level.
Looking at the metadata, the cell line used is listed in the cell
column. This example has 4 different cell lines used. Since comparing different cell lines may confound the analysis of the effects of dexamethasone, we will want to control for this variable in our design.
With our SummarizedExperiment object and design in hand, we can now create a DESeqDataSet object.
dds <- DESeqDataSet(gse, design = ~ cell + dex)
First we will filter genes which are not or very infrequently detected. This will help to speed up the analysis by preventing comparisons for genes which are barely detected anyway. A general recommendation from the DESeq2 authors is to keep genes which have at least 10 counts for a minimum number of samples (defined as the smallest group size). Here, we have two groups with 4 sample each, so our minimum number of samples is 4.
nrow_prefilter <- nrow(dds)
keep <- rowSums(counts(dds) >= 10) >= 4
dds <- dds[keep,]
cat("Number of genes before filtering:", nrow_prefilter,
"\nNumber of genes after filtering:", nrow(dds))
Number of genes before filtering: 37951
Number of genes after filtering: 14962
This filter removed about 23,000 genes.
There are many ways to explore the data prior to running statistical test. Here, we'll use principal components analysis (PCA) to qualitatively explore clustering of samples. There are many other ways to explore the data- you can see an example of other methods on the tutorial that inspired this one here.
Before we do any exploration, we need to address the data heteroscedasticity. This means that the variance of the data is not constant relative to the values of the data itself. Most exploratory data tools operate on the assumption that the variance is stable, but that is not generally true of count data, which includes RNAseq. DESeq2 provides transformation tools to address this. There are two options, variance-stabilizing transform (VST) and regularized-logarithm transformation (rlog). The DESeq2 authors give the following recommendations for choosing between them: - VST is faster than rlog, so may be preferred when analyzing large datasets (30+ samples) - VST is less sensitive to high count outliers than rlog - rlog may be preferred when there is a large difference in library size across samples (eg, an order of magnitude or more)
We can check the library size of each sample after filtering:
colSums(counts(dds))
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
21108721 19351430 26062350 15685250 25142569 31799563 19656881 21815946
Since we do not have a large difference in library size across samples, we will use VST here. Because the statistical tests that we will run later in this tutorial require raw (not transformed) counts, we will be careful not to overwrite the raw counts.
vsd <- vst(dds, blind = FALSE)
vsd
class: DESeqTransform
dim: 14962 8
metadata(8): tximetaInfo quantInfo ... assignRanges version
assays(1): ''
rownames(14962): ENSG00000000003 ENSG00000000419 ... ENSG00000287977 ENSG00000288031
rowData names(13): gene_id gene_name ... allZero dispFit
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(10): SampleName cell ... BioSample names
Here, we set blind = FALSE
to allow the algorithm to use the design that we established earlier to estimate group variances instead of an overall variance. Now that the data have been transformed, we can create a PCA plot to examine the data, grouped by dex
:
plotPCA(vsd, intgroup = "dex")
Now do the same, but group by cell
:
plotPCA(vsd, intgroup = "cell")
Note how the dex
grouping seems to separate the data in PCA space well, while the cell
grouping does not. This is a good sign that most of the transcriptional variance that was captured is due to treatment, and not cell line.
Now, we will use DESeq2 to perform differential expression analysis. These statistical tests will tell us which genes, if any, are more highly expressed in one group relative to another.
We need to use raw, not transformed, counts for this step. Therefore, we'll go back to our dds object instead of vsd. To run the analysis, we use the DESeq()
function.
dds <- DESeq(dds)
dds
class: DESeqDataSet
dim: 14962 8
metadata(8): tximetaInfo quantInfo ... assignRanges version
assays(7): counts abundance ... H cooks
rownames(14962): ENSG00000000003 ENSG00000000419 ... ENSG00000287977 ENSG00000288031
rowData names(43): gene_id gene_name ... deviance maxCooks
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(10): SampleName cell ... BioSample names
This command updates our dds object to add the results of the differential expression tests. We can summarize the results of those tests by using the results()
function:
res <- results(dds)
res
log2 fold change (MLE): dex trt vs untrt
Wald test p-value: dex trt vs untrt
DataFrame with 14962 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003 745.4556 -0.36829304 0.1011509 -3.641024972 2.71555e-04 1.99498e-03
ENSG00000000419 514.0339 0.19272943 0.1256458 1.533911213 1.25051e-01 2.95860e-01
ENSG00000000457 316.1027 -0.00012527 0.1563233 -0.000801355 9.99361e-01 9.99774e-01
ENSG00000000460 93.5087 -0.19217432 0.2820222 -0.681415452 4.95609e-01 7.05313e-01
ENSG00000000971 5816.2774 0.42807726 0.0859304 4.981675143 6.30362e-07 8.65921e-06
... ... ... ... ... ... ...
ENSG00000287725 24.8133 -0.5503712 0.665001 -0.827625 0.407883 0.629768
ENSG00000287856 37.8949 -0.5059185 0.412521 -1.226408 0.220045 0.431877
ENSG00000287908 18.2239 0.0930476 0.643105 0.144685 0.884960 0.948305
ENSG00000287977 11.6876 0.5146171 0.673482 0.764115 0.444799 0.662099
ENSG00000288031 13.1838 0.1969163 0.631894 0.311629 0.755323 0.880450
The output of results()
is a DataFrame with several entries: - baseMean
: the normalized average count value of all samples present for this gene - log2FoldChange
: the log2-transformed effect size estimate - lfcSE
: the standard error estimate of the log2FoldChange - stat
: the calculated Wald statistic - pvalue
: the Wald test p-value - padj
: the BH-adjusted p-value
It is important to note that by default, results()
will return the log2 fold change and p-values for the last entry in the design formula. Recall that when we first created our DESeqDataSet object, our design was ~ cell + dex
. Therefore, the changes that we see here are specifically for changes in dex
. We also set the first factor level of dex
to be "untrt", so that it's considered as the reference value. We can confirm all of this based on the first two lines of this output, which reveals that our comparison here is "dex trt vs untrt". For a more complex analysis, you can provide the comparison more specifically as a contrast. See the DESeq2 documentation for more information on contrasts.
You can also generate a short summary of the results using the summary()
function:
summary(res)
out of 14962 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 2258, 15%
LFC < 0 (down) : 2012, 13%
outliers [1] : 0, 0%
low counts [2] : 291, 1.9%
(mean count < 11)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
We can see that with an adjusted p-value cutoff of 0.1 and log2 fold change cutoff of 0, 15% of the tested genes were found to be upregulated in the dexamethasone-treated group, while 13% were found to be downregulated. We may choose to be more (or less) stringent with our cutoffs, which would impact the number of genes found to be differentially regulated.
The above results returns genes as ENSEMBL IDs, but it is often more informative to have gene symbols present. We can map these IDs to their matching gene symbols using the AnnotationDbi
package.
ens.str <- substr(rownames(res), 1, 15)
res$symbol <- mapIds(org.Hs.eg.db,
keys=ens.str,
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
res
log2 fold change (MLE): dex trt vs untrt
Wald test p-value: dex trt vs untrt
DataFrame with 14962 rows and 7 columns
baseMean log2FoldChange lfcSE stat pvalue padj symbol
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <character>
ENSG00000000003 745.4556 -0.36829304 0.1011509 -3.641024972 2.71555e-04 1.99498e-03 TSPAN6
ENSG00000000419 514.0339 0.19272943 0.1256458 1.533911213 1.25051e-01 2.95860e-01 DPM1
ENSG00000000457 316.1027 -0.00012527 0.1563233 -0.000801355 9.99361e-01 9.99774e-01 SCYL3
ENSG00000000460 93.5087 -0.19217432 0.2820222 -0.681415452 4.95609e-01 7.05313e-01 FIRRM
ENSG00000000971 5816.2774 0.42807726 0.0859304 4.981675143 6.30362e-07 8.65921e-06 CFH
... ... ... ... ... ... ... ...
ENSG00000287725 24.8133 -0.5503712 0.665001 -0.827625 0.407883 0.629768 NA
ENSG00000287856 37.8949 -0.5059185 0.412521 -1.226408 0.220045 0.431877 NA
ENSG00000287908 18.2239 0.0930476 0.643105 0.144685 0.884960 0.948305 NA
ENSG00000287977 11.6876 0.5146171 0.673482 0.764115 0.444799 0.662099 NA
ENSG00000288031 13.1838 0.1969163 0.631894 0.311629 0.755323 0.880450 NA
Now, res
has a column named symbol
that provides the gene symbol for each ENSEMBL gene ID when available. If no symbol is available, this column will instead have NA
.
Finally, we can write the results to a CSV file for export:
write.csv(res, "DEA_results.csv)
In this tutorial, we have demonstrated how to acquire raw sequencing data from SRA, how to map RNAseq to the human transcriptome using Salmon, how to import those quantifications into R, and how to use the DESeq2 package in Bioconductor to perform an exploratory data analysis and differential expression analysis. This was a brief introduction - there exist many other software packages and analysis options for this type of analysis, but this should provide a solid start.
Here are some links which might be useful: