In this exercise we will go through an RNA-seq pipeline, including reference-free transcriptome assembly and differential gene expression analysis.
We will use the Trinity software package for transcriptome assembly (https://github.com/trinityrnaseq/trinityrnaseq/wiki).
See the following tutorial for additional guidance, particularly if you are interested in using the Trinity scripts available for automating steps in the pipeline: Haas et al (2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols. https://www.nature.com/articles/nprot.2013.084
1.1. Connect to the server containing the data:
%%bash
ssh yourname@serveraddress
1.2. Change into the directory containing the materials for this exercise:
%%bash
cd ~/TodosSantosTrinity2022
1.3. Examine the contents of the directory:
%%bash
ls
We will analyze paired-end data from human brain and control tissues, each with 3 biological replicates:
ref_rep1_1.fastq.gz
ref_rep1_2.fastq.gz
ref_rep2_1.fastq.gz
ref_rep2_2.fastq.gz
ref_rep3_1.fastq.gz
ref_rep3_2.fastq.gz
brain_rep1_1.fastq.gz
brain_rep1_2.fastq.gz
brain_rep2_1.fastq.gz
brain_rep2_2.fastq.gz
brain_rep3_1.fastq.gz
brain_rep3_2.fastq.gz
1.4. Create a conda environment to install and run Trinity
in:
%%bash
conda create --name Trinity
conda activate Trinity
conda install -c bioconda trinity=2.8.5
Here, we will use fastp
to remove adapter sequence and low quality bases from reads and filter out low quality reads and reads shorter than 36 nt.
See the fastp
manual for more details - https://github.com/OpenGene/fastp
The paper describing fastp
is here - https://academic.oup.com/bioinformatics/article/34/17/i884/5093234
2.1. Trim off adapters and filter low quality reads from each of the samples. There are two samples, ref
, and brain
. This was a paired-end sequencing experiment so there are 2 libraries, left end (_1
)and right end (_2
), for each library.
%%bash
fastp \
-i library_name_1.fastq -I library_name_2.fastq \
-o trimmed_library_name_1.fastq -O trimmed_library_name_2.fastq \
--length_required 36 --html fastp.html
This is the actual code with the fastq file names (the code can be combined into a bash script or run consecutively by pasting into the terminal):
%%bash
fastp \
-i ref_rep1_1.fastq.gz -I ref_rep1_2.fastq.gz \
-o trimmed_ref_rep1_1.fastq.gz -O trimmed_ref_rep1_2.fastq.gz \
--length_required 36 --html ref_rep1_fastp.html
fastp \
-i ref_rep2_1.fastq.gz -I ref_rep2_2.fastq.gz \
-o trimmed_ref_rep2_1.fastq.gz -O trimmed_ref_rep2_2.fastq.gz \
--length_required 36 --html ref_rep2_fastp.html
fastp \
-i ref_rep3_1.fastq.gz -I ref_rep3_2.fastq.gz \
-o trimmed_ref_rep3_1.fastq.gz -O trimmed_ref_rep3_2.fastq.gz \
--length_required 36 --html ref_rep3_fastp.html
fastp \
-i brain_rep1_1.fastq.gz -I brain_rep1_2.fastq.gz \
-o trimmed_brain_rep1_1.fastq.gz -O trimmed_brain_rep1_2.fastq.gz \
--length_required 36 --html brain_rep1_fastp.html
fastp \
-i brain_rep2_1.fastq.gz -I brain_rep2_2.fastq.gz \
-o trimmed_brain_rep2_1.fastq.gz -O trimmed_brain_rep2_2.fastq.gz \
--length_required 36 --html brain_rep2_fastp.html
fastp \
-i brain_rep3_1.fastq.gz -I brain_rep3_2.fastq.gz \
-o trimmed_brain_rep3_1.fastq.gz -O trimmed_brain_rep3_2.fastq.gz \
--length_required 36 --html brain_rep3_fastp.html
2.2. Examine the fastp.html
reports (we'll go over this together).
Trinity will be used to assemble transcripts.
Want to know how it works? See https://www.broadinstitute.org/videos/trinity-how-it-works
3.1. Combine the right RNA-seq reads (_1
) and the left RNA-seq reads (_2
) from all samples into single files (you will have two files in the end):
%%bash
cat trimmed*_1*gz >reads.ALL.left.fastq.gz
cat trimmed*_2*gz >reads.ALL.right.fastq.gz
cat
is a unix command for concatenating (combining) fies.
*
is a wildcard character that can match any character or no character at all.
3.2. Assemble the reads into a reference transcriptome (Note: this data is not strand-specific. If it was we would need to specify which strand is the reference strand with --SS_lib_type
):
%%bash
Trinity \
--seqType fq \
--left reads.ALL.left.fastq.gz \
--right reads.ALL.right.fastq.gz \
--CPU 4 \
--max_memory 20G
--seqType # fastq (fq)
--left # left reads fastq file
--right # right reads fastq file
--CPU # controls the number of threads used
--max_memory # controls the amount of RAM used
After the run is complete, the transcripts will be in the output file trinity_out_dir.Trinity.fasta
, a name automatically generated by Trinity
.
3.3. Compute the number of transcipts and assembled bases:
%%bash
grep ">" trinity_out_dir/Trinity.fasta | wc -l
grep -v ">" trinity_out_dir/Trinity.fasta | wc -m
grep
is a utility for searching for patterns in file or file names. It will print lines to the terminal than contain a match to the pattern of interest, such as ">"
. Note that here we pipe the output to the word count command, wc
.
The -v
option modifies grep to print lines that do not contain the pattern.
For more information on grep
, see https://www.gnu.org/software/grep/manual/grep.html.
3.4. Calculate the average transcript length using the values from above (you can access a basic calculator from the terminal with the command bc
- type quit
or ctrl+d
to exit the calculator):
total bases/number of transcripts
The Trinity
suite has a script (TrinityStats.pl
) that can provide a more detailed analysis but we will not be using it.
Here we will count the numbers of reads aligning to each trancript in each of our libraries using Bowtie2
and RSEM
. We could take a similar approach if we were working with an annotated genome and not relying on Trinity. However, we would likely use star
and a genome sequence rather than Bowtie2
and transcript sequences for alignments because star
works better for mapping reads across splice junctions.
We don't need to directly call bowtie2
because RSEM
will do it for us.
A tutorial on RSEM can be found here - https://github.com/bli25broad/RSEM_tutorial.
A paper describing RSEM is here - https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323
A paper describing Bowtie2
is here - https://www.nature.com/articles/nmeth.1923
This can be automated with Trinity
using the script align_and_estimate_abundance.pl
.
4.1. Create a bowtie
index from the Trinity
transcripts that will be used for mapping our reads to with Bowtie2
(replace 'prefix' with a more descriptive name for the prefix, such as hsa
and transcripts.fasta
with the Trinity
output file). For this we'll use rsem-prepare-reference
as follows:
%%bash
rsem-prepare-reference --num-threads 4 --bowtie2 'transcripts.fasta' 'prefix'
This is what your code should look like:
%%bash
rsem-prepare-reference --num-threads 4 --bowtie2 trinity_out_dir/Trinity.fasta hsa
4.2. Align reads and count reads per transcript for each library using RSEM
. This step has to be repeated for each library). You'll need to edit the names in ''
in the following code:
%%bash
rsem-calculate-expression \
--num-threads 4 --bowtie2 --paired-end --append-names \
'left_reads.fastq' 'right_reads.fastq' 'prefix' 'name'
We can use a bash script to accomplish this step. Write your own or copy the contents of the cell below to a text editor file named align.sh
.
rsem-calculate-expression \
--num-threads 4 --bowtie2 --paired-end --append-names \
trimmed_ref_rep1_1.fastq.gz trimmed_ref_rep1_2.fastq.gz hsa ref1
rsem-calculate-expression \
--num-threads 4 --bowtie2 --paired-end --append-names \
trimmed_ref_rep2_1.fastq.gz trimmed_ref_rep2_2.fastq.gz hsa ref2
rsem-calculate-expression \
--num-threads 4 --bowtie2 --paired-end --append-names \
trimmed_ref_rep3_1.fastq.gz trimmed_ref_rep3_2.fastq.gz hsa ref3
rsem-calculate-expression \
--num-threads 4 --bowtie2 --paired-end --append-names \
trimmed_brain_rep1_1.fastq.gz trimmed_brain_rep1_2.fastq.gz hsa brain1
rsem-calculate-expression \
--num-threads 4 --bowtie2 --paired-end --append-names \
trimmed_brain_rep2_1.fastq.gz trimmed_brain_rep2_2.fastq.gz hsa brain2
rsem-calculate-expression \
--num-threads 4 --bowtie2 --paired-end --append-names \
trimmed_brain_rep3_1.fastq.gz trimmed_brain_rep3_2.fastq.gz hsa brain3
Copy the script into the TodosSantosTrinity2022
directory on the server and run it from the command line as follows:
%%bash
bash align.sh
4.3. Examine the RSEM
counts summary file for each library (sample.stat/smaple.cnt
) using the unix less
command:
%%bash
less ref1.stat/ref1.cnt
The file will have the following format:
line1: N0 N1 N2 N_tot
# N0, unalignable reads; N1, alignable reads; N2, discared due to too many alignments; N_tot = N0 + N1 + N2
line2: nUnique nMulti nUncertain
# nUnique: number of reads aligned uniquely to a gene
# nMulti: number of reads aligned to multiple genes
# nUncertain: number of reads aligned to multiple locations in the given reference sequences
line3:
nHits read_type
# nHits: number of total alignments.
# read_type: 0, single-end read, no quality score; 1, single-end read, with quality score; 2, paired-end read, no quality score; 3, paired-end read, with quality score
line4 etc:
0 Reads withh no alignments
1 Reads with 1 aligngment
...
Inf Filtered because of too many alignments
4.4. RSEM
will generate gene level and isoform level counts files for the aligned reads. Examine one of the RSEM
gene level counts files (sample.genes.results
, we'll go over this together):
%%bash
less ref1.genes.results
We'll use DESeq2
to identify differentially expressed genes. DESeq2
is an R
program and we will be working within the R
environment in the terminal.
For information about DESeq2
, see its bioconductor page - https://www.bioconductor.org/packages//2.10/bioc/html/DESeq.html
For the paper describing DESeq2
, see - https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-10-r106
Differential expression analysis can be automated in Trinity
with the scripts abundance_estimates_to_matrix.pl
, run_DE_analysis.pl
, and analyze_diff_expr.pl
but we'll do it manually here.
5.1. This exercise requires two R
packages - DESeq2
and tximport
. To install them, we use the following commands:
%%bash
R
Typing R
in the terminal launches the R
console.
%%R
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("tximport")
BiocManager::install("DESeq2")
BiocManager
is package manager for the Bioconductor project which hosts a lot of genomics tools. The if
statement split across the first 2 lines will install BiocManager
if it's not already installed).
To load the packages, use the following commands:
%%R
library("tximport")
library("DESeq2")
5.2. To start the differential expression analysis, we'll create a vector of the names of the files containing the transcript abundance measurements we generated using RSEM
:
%%R
# create a vector with file names
# the c() function combines elements into a vector
gene_counts = c("ref1.genes.results", "ref2.genes.results", "ref3.genes.results", "brain1.genes.results", "brain2.genes.results", "brain3.genes.results")
# create vector with sample names
sample_names = c("ref1", "ref2", "ref3", "brain1", "brain2", "brain3")
# link the sample names and file names by adding names to the gene_counts vector members
# this uses two R functions, names() and paste()
names(gene_counts) = sample_names
5.3. Next, we'll use an R
package called tximport
to import the abundance measurement from RSEM
into DESeq2
(this is a multi-step process):
%%R
# import counts data
txi_rsem = tximport(gene_counts, type = "rsem",)
# change genes with an effective length of 0 to have an effective length of 1
# the $ operator is used to extract a subset of a data object
txi_rsem$length[txi_rsem$length == 0] = 1
# create a sample table data frame (data frames are a common type of table in R)
sample_table = data.frame(condition = c("ref", "ref", "ref", "brain", "brain", "brain"))
# see what the data frame looks like
sample_table
# add sample names to sample_table using the rownames() and colnames() functions
rownames(sample_table) = colnames(txi_rsem$counts)
# see what the modified data frame looks like
sample_table
# create a DESeq dataset (commonly given the name dds)
dds = DESeqDataSetFromTximport(txi_rsem, colData=sample_table, design=~condition)
5.4. Now we'll use DESeq2
to compare the levels of each gene between the two sample groups, brain
and ref
:
%%R
# run DESeq on dds dataset
dds = DESeq(dds)
# create a results table comparing brain to ref using the DESeq results() function
res = results(dds, contrast=c("condition", "brain", "ref"))
# examine the first 6 rows of the results table
head(res)
5.5. Let's write the results to output files:
%%R
# write DGE results to a csv file using the write.csv(data_to_write, file_name) function
write.csv(res, 'Brain_vs_Ref.csv')
# write normalized counts to a csv file
# we can get the counts using the DESeq counts() function and we'll specify that the data should be normalized
write.csv(counts(dds, normalized=TRUE), 'Counts_Table.csv')
5.6. Calculate the mean of each set of biological replicates:
%%R
# create a data frame with the counts data
cnts = counts(dds, normalized=TRUE)
# create a new data frame, cnts_avg, and poplulate with the counts data as a placeholder
cnts_avg = cnts[,1:2]
# replace column 1 with mean counts for ref
cnts_avg[,1] = rowMeans(cnts[,1:3])
# replace column 2 with mean counts for brain
cnts_avg[,2] = rowMeans(cnts[,4:6])
# optional: rename columns for accuracy
colnames(cnts_avg) = c("ref_mean","brain_mean")
5.7. Subset the genes that are significantly different between brain
and ref
(p < 0.05):
%%R
# subset the results table to contain only significantly differentially expressed genes
# the as.data.frame() converts an object to a data frame
# the subset() function returns a subset of data that meets a certain condition
# recall that res is our DESeq results table generated in step 5.4
res_sig = as.data.frame(subset(res, padj < 0.05))
# get only the signicantly differentially expressed genes from the counts_avg table
# rownames(res_sig) will return only the data matching the row names (genes) in res_sig
# square brackets, [], can be used to extract certain elements from a data object
# data_object[rows, columns], if either side of the the ',' is empty, all rows or columns will be returned
cnts_avg_sig = cnts_avg[rownames(res_sig),]
5.8. Plot the mean counts for brain
against the mean counts for ref
:
%%R
# intialize the plot output file with the pdf() function
pdf("plot.pdf")
# plot all data points using the generic plot() function
plot(log2(cnts_avg[,1]), log2(cnts_avg[,2]), main="brain vs ref", xlab="ref", ylab="brain", pch=16, col="grey80")
# overlay data points for which p < 0.05
points(log2(cnts_avg_sig[,1]), log2(cnts_avg_sig[,2]), pch=16, col="cyan2")
# add y = x line
abline(0,1, col="grey60")
# add y = 2x line
abline(1,1, col="grey60")
# add y = 0.5x line
abline(-1,1, col="grey60")
# close the plot output file
dev.off()
There are many tools that automate differentiall gene expression analysis and plotting. Feel free on your own to try one of our tools - https://github.com/MontgomeryLab/DESeq2App.
If we have time, we'll blast
a couple of the differentially expressed transcripts against human databases to see what they are.
From our analysis, we now have some crude transcripts, as well as counts tables and a table and plot of differentially expressed transcripts. But we don't know what the transcripts are. So next we might want to look for homology between our transcripts and characterized transcripts and perhaps search for fuctional domains and other features that would help us to characterize our transcripts. This can be automated to some extent using Trinotate
available from https://github.com/Trinotate/Trinotate.github.io/wiki.