RNA-seq: Transcriptome Assembly and Differential Expression Analysis

In this exercise we will go through an RNA-seq pipeline, including reference-free transcriptome assembly and differential gene expression analysis.

We will use the Trinity software package for transcriptome assembly (https://github.com/trinityrnaseq/trinityrnaseq/wiki).

See the following tutorial for additional guidance, particularly if you are interested in using the Trinity scripts available for automating steps in the pipeline: Haas et al (2013). De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols. https://www.nature.com/articles/nprot.2013.084

OUTLINE

  1. Getting started.
  2. Trim adapters and filter low quality reads.
  3. De novo assemble transcriptome.
  4. Estimate the number of reads aligning to each transcript in each library.
  5. Identify differentially expressed genes.

Part 1. Getting started

1.1. Connect to the server containing the data:

1.2. Change into the directory containing the materials for this exercise:

1.3. Examine the contents of the directory:

We will analyze paired-end data from human brain and control tissues, each with 3 biological replicates:

ref_rep1_1.fastq.gz
ref_rep1_2.fastq.gz
ref_rep2_1.fastq.gz
ref_rep2_2.fastq.gz
ref_rep3_1.fastq.gz
ref_rep3_2.fastq.gz
brain_rep1_1.fastq.gz
brain_rep1_2.fastq.gz
brain_rep2_1.fastq.gz
brain_rep2_2.fastq.gz
brain_rep3_1.fastq.gz
brain_rep3_2.fastq.gz

1.4. Create a conda environment to install and run Trinity in:


Part 2. Trim adapters and filter low quality reads

Here, we will use fastp to remove adapter sequence and low quality bases from reads and filter out low quality reads and reads shorter than 36 nt.

See the fastp manual for more details - https://github.com/OpenGene/fastp

The paper describing fastp is here - https://academic.oup.com/bioinformatics/article/34/17/i884/5093234

2.1. Trim off adapters and filter low quality reads from each of the samples. There are two samples, ref, and brain. This was a paired-end sequencing experiment so there are 2 libraries, left end (_1)and right end (_2), for each library.

This is the actual code with the fastq file names (the code can be combined into a bash script or run consecutively by pasting into the terminal):

2.2. Examine the fastp.html reports (we'll go over this together).


Part 3. De novo transcriptome assembly

Trinity will be used to assemble transcripts.

Want to know how it works? See https://www.broadinstitute.org/videos/trinity-how-it-works

3.1. Combine the right RNA-seq reads (_1) and the left RNA-seq reads (_2) from all samples into single files (you will have two files in the end):

cat is a unix command for concatenating (combining) fies.

* is a wildcard character that can match any character or no character at all.

3.2. Assemble the reads into a reference transcriptome (Note: this data is not strand-specific. If it was we would need to specify which strand is the reference strand with --SS_lib_type):

--seqType      # fastq (fq)
--left         # left reads fastq file
--right        # right reads fastq file
--CPU          # controls the number of threads used
--max_memory   # controls the amount of RAM used

After the run is complete, the transcripts will be in the output file trinity_out_dir.Trinity.fasta, a name automatically generated by Trinity.

3.3. Compute the number of transcipts and assembled bases:

grep is a utility for searching for patterns in file or file names. It will print lines to the terminal than contain a match to the pattern of interest, such as ">". Note that here we pipe the output to the word count command, wc.

The -v option modifies grep to print lines that do not contain the pattern.

For more information on grep, see https://www.gnu.org/software/grep/manual/grep.html.

3.4. Calculate the average transcript length using the values from above (you can access a basic calculator from the terminal with the command bc - type quit or ctrl+d to exit the calculator):

The Trinity suite has a script (TrinityStats.pl) that can provide a more detailed analysis but we will not be using it.


Part 4. Estimate the number of reads aligning to each transcript in each library

Here we will count the numbers of reads aligning to each trancript in each of our libraries using Bowtie2 and RSEM. We could take a similar approach if we were working with an annotated genome and not relying on Trinity. However, we would likely use star and a genome sequence rather than Bowtie2 and transcript sequences for alignments because star works better for mapping reads across splice junctions.

We don't need to directly call bowtie2 because RSEM will do it for us.

A tutorial on RSEM can be found here - https://github.com/bli25broad/RSEM_tutorial.

A paper describing RSEM is here - https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-323

A paper describing Bowtie2 is here - https://www.nature.com/articles/nmeth.1923

This can be automated with Trinity using the script align_and_estimate_abundance.pl.

4.1. Create a bowtie index from the Trinity transcripts that will be used for mapping our reads to with Bowtie2 (replace 'prefix' with a more descriptive name for the prefix, such as hsa and transcripts.fasta with the Trinity output file). For this we'll use rsem-prepare-reference as follows:

This is what your code should look like:

4.2. Align reads and count reads per transcript for each library using RSEM. This step has to be repeated for each library). You'll need to edit the names in '' in the following code:

We can use a bash script to accomplish this step. Write your own or copy the contents of the cell below to a text editor file named align.sh.

Copy the script into the TodosSantosTrinity2022 directory on the server and run it from the command line as follows:

4.3. Examine the RSEM counts summary file for each library (sample.stat/smaple.cnt) using the unix less command:

The file will have the following format:

line1: N0 N1 N2 N_tot
#  N0, unalignable reads; N1, alignable reads; N2, discared due to too many alignments; N_tot = N0 + N1 + N2

line2: nUnique nMulti nUncertain 
# nUnique: number of reads aligned uniquely to a gene
# nMulti: number of reads aligned to multiple genes 
# nUncertain: number of reads aligned to multiple locations in the given reference sequences

line3:
nHits read_type
# nHits: number of total alignments. 
# read_type: 0, single-end read, no quality score; 1, single-end read, with quality score; 2, paired-end read, no quality score; 3, paired-end read, with quality score

line4 etc:
0    Reads withh no alignments
1    Reads with 1 aligngment
...
Inf  Filtered because of too many alignments

4.4. RSEM will generate gene level and isoform level counts files for the aligned reads. Examine one of the RSEM gene level counts files (sample.genes.results, we'll go over this together):


Part 5. Identify differentially expressed genes

We'll use DESeq2 to identify differentially expressed genes. DESeq2 is an R program and we will be working within the R environment in the terminal.

For information about DESeq2, see its bioconductor page - https://www.bioconductor.org/packages//2.10/bioc/html/DESeq.html

For the paper describing DESeq2, see - https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-10-r106

Differential expression analysis can be automated in Trinity with the scripts abundance_estimates_to_matrix.pl, run_DE_analysis.pl, and analyze_diff_expr.pl but we'll do it manually here.

5.1. This exercise requires two R packages - DESeq2 and tximport. To install them, we use the following commands:

Typing R in the terminal launches the R console.

BiocManager is package manager for the Bioconductor project which hosts a lot of genomics tools. The if statement split across the first 2 lines will install BiocManager if it's not already installed).

To load the packages, use the following commands:

5.2. To start the differential expression analysis, we'll create a vector of the names of the files containing the transcript abundance measurements we generated using RSEM:

5.3. Next, we'll use an R package called tximport to import the abundance measurement from RSEM into DESeq2 (this is a multi-step process):

5.4. Now we'll use DESeq2 to compare the levels of each gene between the two sample groups, brain and ref:

5.5. Let's write the results to output files:

5.6. Calculate the mean of each set of biological replicates:

5.7. Subset the genes that are significantly different between brain and ref (p < 0.05):

5.8. Plot the mean counts for brain against the mean counts for ref:

There are many tools that automate differentiall gene expression analysis and plotting. Feel free on your own to try one of our tools - https://github.com/MontgomeryLab/DESeq2App.

If we have time, we'll blast a couple of the differentially expressed transcripts against human databases to see what they are.

From our analysis, we now have some crude transcripts, as well as counts tables and a table and plot of differentially expressed transcripts. But we don't know what the transcripts are. So next we might want to look for homology between our transcripts and characterized transcripts and perhaps search for fuctional domains and other features that would help us to characterize our transcripts. This can be automated to some extent using Trinotate available from https://github.com/Trinotate/Trinotate.github.io/wiki.