Snakemake is a workflow management system. It is used to make computational worfklows that are well documented and reproducible. It is especially useful if you have complicated workflows with multiple tools and paremeters. Workflows are resuable and customizable and also very portable.
See the Snakemake documentation page for details: https://snakemake.readthedocs.io/en/stable/index.html
The easiest way to install Snakemake is with conda
as follows:
Step 1. Install Miniconda
(the installation script is in your home directory but could also be downloaded from here - https://docs.conda.io/en/latest/miniconda.html#linux-installers):
%%bash
bash Miniconda3-py39_4.11.0-Linux-x86_64.sh
Exit the terminal with the command exit
. Then log back into your account.
Step 2: Create a conda
environment called Snake
:
%%bash
conda create --name Snake python=3.6
conda activate Snake
Step 3: Install Snakemake
and graphvis
(optional) within the Snake
environment:
%%bash
conda install -c bioconda snakemake
conda install -c anaconda graphviz # only necessary for generating a workflow diagram
This is an optional step. The template includes the minimal GitHub-ready files for a Snakemake workflow:
%%bash
git clone https://github.com/taimontgomery/SnakemakeTutorial.git
If you clone the repository, change into the SnakemakeTutorial
directory and explore the contents using the ls
command:
%%bash
cd SnakemakeTutorial/
%%bash
ls
Let's start with a very simple workflow that combines two files into one using the cat
command.
1.1. Each step in a snakemake workflow is called a rule. This workflow will only have 1 rule but below we'll see a more complex example. Rules are written in a Snakefile. Create a file called Snakefile_cat
on your own computer using a text editor or use Snakefile
from the SnakemakeTutorial
repository.
Rules have the following general format although they may contain more or fewer elements:
'''
Anything embedded within ''' ''' is a comment.
Each rule starts with the keyword "rule" followed by a name, which is usually something descriptive.
After the rule comes a series of directives, indented by 4 spaces or a tab.
Within each directive are paremeters for that directive.
Comments can be included after a "#".
'''
rule cat:
'''
Each input and output file is specified as follows:
'''
input:
fin1="seq1.txt", # commas are used to separate items
fin2="seq2.txt"
output:
fout="seq_cat.txt"
'''
Within the shell directive, specify the command line code for the task.
'''
shell:
"cat {input.fin1} {input.fin2} >{output.fout}" # variables are called with dot notation referencing
# the directive and enclosed in "{}"
In the simple example above, we are using the Linux cat command to concatenate the contents of 2 files. It looks a bit complicated with all the comments but when we remove those below it is much simpler:
rule cat:
input:
fin1="seq1.txt",
fin2="seq2.txt"
output:
fout="seq_cat.txt"
shell:
"cat {input.fin1} {input.fin2} >{output.fout}"
1.2. Copy the code from above into the Snakefile
.
1.3. Create seq1.txt
and seq2.txt
files in your directory on the server using the following Linux commands:
%%bash
echo ATCG >seq1.txt
echo GGCG >seq2.txt
1.4. Now that we have the required input files on the server, run snakemake using one of the following commands depending on the name of your file:
snakemake -s Snakefile_cat
or
snakemake -s Snakefile
The -s
flag is only necessary if you're using a custom named snakefile (snakemake will look for a file called Snakefile
if the -s
flag is absent).
1.5. Confirm that the output file was created successfully and contains the sequences from seq1.txt
and seq2.txt
:
%%bash
less seq_cat.txt
Now let's write a more complex workflow to map reads to a reference genome and then prepare the data for visualization in the genome viewing software IGV
.
2.1. Our first rule will accomplish step 1 of our pipeline: aligning small RNA-seq reads to a reference genome. For this we will use bowtie2
. Copy the code below into the Snakemake
file in the repository (after deleting any other contents in the file) or create a new file.
rule map_reads:
input:
sample="samples/cond1_rep1.fastq"
output:
bam="results/cond1_rep1.bam"
conda:
"envs/mapping.yml"
shell:
"bowtie2 -x resources/ram1 {input.sample} | samtools view -bS - -o {output.bam}"
There's an extra directive in this code called conda
. This is used to specify which version of software to use for this rule. The paremeter is a YAML file called mapping.yml
within the envs
directory that contains all of the software needed for this workflow. Snakemake
will check to see if the software is installed and install it if needed. How would you modify mapping.yml
if you wanted to use different software?
Our code so far is not very efficient because we would need to run it separately for each library. There are a couple ways around this. I like to include a separate file, called a configuration file, that contains paremeters for the workflow, including file names. That allows us to change the file names without having to edit Snakemake file each time I run it.
2.2. Open config.yml
within the repository, or creater your own, and copy the following text to it:
samples:
- cond1_rep1
- cond1_rep2
- cond1_rep3
- cond2_rep1
- cond2_rep2
- cond2_rep3
In this configuration file, samples
is a list of file prefixes with each file within the list being indented with 2 spaces, a syntax feature of YAML. Now when we run snakemake from the command line, we'll use the flag --configfile
to direct the workflow to config.yml
, which we'll see at the end of this tutorial.
2.3. Now we can retrieve the file prefixes from config.yml
. We'll need to add another rule, all
, to the beginning of our snakefile to accomplish this:
rule all:
input:
expand("results/{sample}.bam", sample=config["samples"])
all
is a special rule for handling this type of situation. The expand()
function will generate every possible combination of file names with the wildcard, in this case {sample}
, where the value of the wildcard is retrieved from the configuration file with the second argument in the function, sample=config["samples"]
. config
will retrieve each value under the key samples
from the configuration file, which will be sequentially inserted in place of the wildcard {sample}
in the first argument. Snakemake
will then iterate over the other rules specified until it runs out of values.
2.4. We next need to modify our mapping rule so that it works with multiple files by replacing file names with wildcards:
rule map_reads:
input:
sample="samples/{sample}.fastq"
output:
bam="results/{sample}.bam"
conda:
"envs/mapping.yml"
shell:
"bowtie2 -x resources/ram1 {input.sample} | samtools view -bS - -o {output.bam}"
Now the all
rule will essentially iterate over the map_reads
rule until it runs out of values for {sample}
.
At this point, your entire Snakemake file should look like this:
rule all:
input:
expand("results/{sample}.bam", sample=config["samples"])
rule map_reads:
input:
sample="samples/{sample}.fastq"
output:
bam="results/{sample}.bam"
conda:
"envs/mapping.yml"
shell:
"bowtie2 -x resources/ram1 {input.sample} | samtools view -bS - -o {output.bam}"
2.5. Next we'll add another rule in which we'll sort the reads in the bam file based on their genomic coordinates, which is often required for downstream software, such as genome browsers. For this we'll use samtools sort
:
rule sort_reads:
input:
bam_fin="results/{sample}.bam"
output:
bam_sorted="results/{sample}.sorted.bam"
conda:
"envs/mapping.yml"
shell:
"samtools sort {input.bam_fin} -o {output.bam_sorted}"
2.6. When we add this rule to the end of our Snakemake file, the final output files change from .bam
files to .sorted.bam
file. This needs to be reflected in the all
rule because the expand()
function takes as its first argument the final output file names. Modify the all
rule within your Snakemake file as follows:
rule all:
input:
expand("results/{sample}.sorted.bam", sample=config["samples"])
At this point, your Snakemake should look like this:
rule all:
input:
expand("results/{sample}.sorted.bam", sample=config["samples"])
rule map_reads:
input:
sample="samples/{sample}.fastq"
output:
bam="results/{sample}.bam"
conda:
"envs/mapping.yml"
shell:
"bowtie2 -x resources/ram1 {input.sample} | samtools view -bS - -o {output.bam}"
rule sort_reads:
input:
bam_fin="results/{sample}.bam"
output:
bam_sorted="results/{sample}.sorted.bam"
conda:
"envs/mapping.yml"
shell:
"samtools sort {input.bam_fin} -o {output.bam_sorted}"
2.7. For our last rule, we'll create a bam index file (bai
), which is basically a table of contents for the sorted.bam
file that is required by many downstream programs:
rule bam_index:
input:
sorted_fin="results/{sample}.sorted.bam"
output:
bai="results/{sample}.sorted.bam.bai"
conda:
"envs/mapping.yml"
shell:
"samtools index {input.sorted_fin}"
Again because our final output file is now a bai
file, we need to edit the all
rule to reflect this:
rule all:
input:
expand("results/{sample}.sorted.bam.bai", sample=config["samples"])
We now have a workflow that we could use to map reads and prep them for plotting in a genome browser. We can edit our configuration file to include different samples wihout having to change the Snakefile from run to run.
2.8. Now let's edit the configuration file so that the reference genome, which in our example is ram1
, can be referenced there instead of within the Snakemake, which will make our workflow even more general. Add the following line of code to the end of config.yml
:
genome: "resources/ram1"
Your config.yml
file should now look like this:
samples:
- cond1_rep1
- cond1_rep2
- cond1_rep3
- cond2_rep1
- cond2_rep2
- cond2_rep3
genome: "resources/ram1"
2.9. We'll need to add another directive within the map_reads
rule called params
where we specify additional paremeters from config.yml
. Add the following code to the map_reads
rule, treating it like any of the other directives:
params:
bowtie_index=config["genome"]
2.10. We also need to modify the shell
directive in the map_reads
rule to reference the new parameter:
shell:
"bowtie2 -x {params.bowtie_index} {input.sample} | samtools view -bS - -o {output.bam}"
You're Snakemake should now look like this:
rule all:
input:
expand("results/{sample}.sorted.bam.bai", sample=config["samples"])
rule map_reads:
input:
sample="samples/{sample}.fastq"
output:
bam="results/{sample}.bam"
conda:
"envs/mapping.yml"
params:
bowtie_index=config["genome"]
shell:
"bowtie2 -x {params.bowtie_index} {input.sample} | samtools view -bS - -o {output.bam}"
rule sort_reads:
input:
bam_fin="results/{sample}.bam"
output:
bam_sorted="results/{sample}.sorted.bam"
conda:
"envs/mapping.yml"
shell:
"samtools sort {input.bam_fin} -o {output.bam_sorted}"
rule bam_index:
input:
sorted_fin="results/{sample}.sorted.bam"
output:
bai="results/{sample}.sorted.bam.bai"
conda:
"envs/mapping.yml"
shell:
"samtools index {input.sorted_fin}"
2.11. You'll notice when you run the workflow (we'll get to that soon) that a lot of information appears in the terminal. This is often valuable information that you will want to save. With Snakemake you can capture this information by including another directive called log
within a rule you want to capture terminal output from. When you run bowtie2
it displays a summary of the alignment results in the terminal. Let's capture this information in a log file) so that it's not lost when we close the terminal.
First, we'll need to add the log
directive to the map_reads
rule, treating it like other directives:
log:
"results/{sample}.log"
Notice that we'll be generating a separate log file for each sample and following the standard approach we'll use .log
as the file extension and store these files in the results
directory.
Next we're going to add some cryptic code (2>
) for redirecting standard error to the shell
directive to direct the terminal output to the .log
file:
shell:
"bowtie2 -x {params.bowtie_index} {input.sample} 2> {log} | samtools view -bS - -o {output.bam}"
You're final Snakemake file should look like this:
rule all:
input:
expand("results/{sample}.sorted.bam.bai", sample=config["samples"])
rule map_reads:
input:
sample="samples/{sample}.fastq"
output:
bam="results/{sample}.bam"
conda:
"envs/mapping.yml"
params:
bowtie_index=config["genome"]
log:
"results/{sample}.log"
shell:
"bowtie2 -x {params.bowtie_index} {input.sample} 2> {log} | samtools view -bS - -o {output.bam}"
rule sort_reads:
input:
bam_fin="results/{sample}.bam"
output:
bam_sorted="results/{sample}.sorted.bam"
conda:
"envs/mapping.yml"
shell:
"samtools sort {input.bam_fin} -o {output.bam_sorted}"
rule bam_index:
input:
sorted_fin="results/{sample}.sorted.bam"
output:
bai="results/{sample}.sorted.bam.bai"
conda:
"envs/mapping.yml"
shell:
"samtools index {input.sorted_fin}"
Done!!! Let's visualize our workflow before we run it. Run the following code in your terminal:
snakemake --configfile config.yml --dag | dot -Tpdf > dag.pdf
This command calls snakemake, specifies the configuration files, and pipes the DAG (Directed Acyclic Graph) of the workflow generated by snakemake to the dot tool withing the graphviz program to generate a PDF that you can view in any PDF viewer.
We'll do a quick dry run to check for errors as follows, where the -n
flag specifies that this is a dry run (output files won't be generated):
snakemake -n --use-conda --configfile config.yml
And finally, let's run it for real, assuming there are no error messages. Copy the code below to your terminal:
snakemake --use-conda --configfile config.yml
The --use-conda
flag specifies that the code should be run in a conda
environment, which is necessary here since we're relying on the bowtie2
and samtools
software to be installed within a conda
environment.
As mentioned above, the --config
flag allows us to specify a configuration file.