Snakemake

Snakemake is a workflow management system. It is used to make computational worfklows that are well documented and reproducible. It is especially useful if you have complicated workflows with multiple tools and paremeters. Workflows are resuable and customizable and also very portable.


Getting Started in Snakemake

See the Snakemake documentation page for details: https://snakemake.readthedocs.io/en/stable/index.html

Install Snakemake

The easiest way to install Snakemake is with conda as follows:

Step 1. Install Miniconda (the installation script is in your home directory but could also be downloaded from here - https://docs.conda.io/en/latest/miniconda.html#linux-installers):

Exit the terminal with the command exit. Then log back into your account.

Step 2: Create a conda environment called Snake:

Step 3: Install Snakemake and graphvis (optional) within the Snake environment:

Clone a Snakemake Template from GitHub

This is an optional step. The template includes the minimal GitHub-ready files for a Snakemake workflow:

If you clone the repository, change into the SnakemakeTutorial directory and explore the contents using the ls command:


Writing Snakemake Workflows

Part 1. A basic introduction using cat

Let's start with a very simple workflow that combines two files into one using the cat command.

Workflow Rules

1.1. Each step in a snakemake workflow is called a rule. This workflow will only have 1 rule but below we'll see a more complex example. Rules are written in a Snakefile. Create a file called Snakefile_cat on your own computer using a text editor or use Snakefile from the SnakemakeTutorial repository.

Rules have the following general format although they may contain more or fewer elements:

In the simple example above, we are using the Linux cat command to concatenate the contents of 2 files. It looks a bit complicated with all the comments but when we remove those below it is much simpler:

1.2. Copy the code from above into the Snakefile.

1.3. Create seq1.txt and seq2.txt files in your directory on the server using the following Linux commands:

1.4. Now that we have the required input files on the server, run snakemake using one of the following commands depending on the name of your file:

The -s flag is only necessary if you're using a custom named snakefile (snakemake will look for a file called Snakefile if the -s flag is absent).

1.5. Confirm that the output file was created successfully and contains the sequences from seq1.txt and seq2.txt:

Part 2. A read-mapping workflow.

Now let's write a more complex workflow to map reads to a reference genome and then prepare the data for visualization in the genome viewing software IGV.

Our pipeline will have 3 computational steps that we will write a workflow to accomplish:

2.1. Our first rule will accomplish step 1 of our pipeline: aligning small RNA-seq reads to a reference genome. For this we will use bowtie2. Copy the code below into the Snakemake file in the repository (after deleting any other contents in the file) or create a new file.

There's an extra directive in this code called conda. This is used to specify which version of software to use for this rule. The paremeter is a YAML file called mapping.yml within the envs directory that contains all of the software needed for this workflow. Snakemake will check to see if the software is installed and install it if needed. How would you modify mapping.yml if you wanted to use different software?

Our code so far is not very efficient because we would need to run it separately for each library. There are a couple ways around this. I like to include a separate file, called a configuration file, that contains paremeters for the workflow, including file names. That allows us to change the file names without having to edit Snakemake file each time I run it.

2.2. Open config.yml within the repository, or creater your own, and copy the following text to it:

In this configuration file, samples is a list of file prefixes with each file within the list being indented with 2 spaces, a syntax feature of YAML. Now when we run snakemake from the command line, we'll use the flag --configfile to direct the workflow to config.yml, which we'll see at the end of this tutorial.

2.3. Now we can retrieve the file prefixes from config.yml. We'll need to add another rule, all, to the beginning of our snakefile to accomplish this:

all is a special rule for handling this type of situation. The expand() function will generate every possible combination of file names with the wildcard, in this case {sample}, where the value of the wildcard is retrieved from the configuration file with the second argument in the function, sample=config["samples"]. config will retrieve each value under the key samples from the configuration file, which will be sequentially inserted in place of the wildcard {sample} in the first argument. Snakemake will then iterate over the other rules specified until it runs out of values.

2.4. We next need to modify our mapping rule so that it works with multiple files by replacing file names with wildcards:

Now the all rule will essentially iterate over the map_reads rule until it runs out of values for {sample}.

At this point, your entire Snakemake file should look like this:

2.5. Next we'll add another rule in which we'll sort the reads in the bam file based on their genomic coordinates, which is often required for downstream software, such as genome browsers. For this we'll use samtools sort:

2.6. When we add this rule to the end of our Snakemake file, the final output files change from .bam files to .sorted.bam file. This needs to be reflected in the all rule because the expand() function takes as its first argument the final output file names. Modify the all rule within your Snakemake file as follows:

At this point, your Snakemake should look like this:

2.7. For our last rule, we'll create a bam index file (bai), which is basically a table of contents for the sorted.bam file that is required by many downstream programs:

Again because our final output file is now a bai file, we need to edit the all rule to reflect this:

We now have a workflow that we could use to map reads and prep them for plotting in a genome browser. We can edit our configuration file to include different samples wihout having to change the Snakefile from run to run.

2.8. Now let's edit the configuration file so that the reference genome, which in our example is ram1, can be referenced there instead of within the Snakemake, which will make our workflow even more general. Add the following line of code to the end of config.yml:

Your config.yml file should now look like this:

2.9. We'll need to add another directive within the map_reads rule called params where we specify additional paremeters from config.yml. Add the following code to the map_reads rule, treating it like any of the other directives:

2.10. We also need to modify the shell directive in the map_reads rule to reference the new parameter:

You're Snakemake should now look like this:

2.11. You'll notice when you run the workflow (we'll get to that soon) that a lot of information appears in the terminal. This is often valuable information that you will want to save. With Snakemake you can capture this information by including another directive called log within a rule you want to capture terminal output from. When you run bowtie2 it displays a summary of the alignment results in the terminal. Let's capture this information in a log file) so that it's not lost when we close the terminal.

First, we'll need to add the log directive to the map_reads rule, treating it like other directives:

Notice that we'll be generating a separate log file for each sample and following the standard approach we'll use .log as the file extension and store these files in the results directory.

Next we're going to add some cryptic code (2>) for redirecting standard error to the shell directive to direct the terminal output to the .log file:

You're final Snakemake file should look like this:

Done!!! Let's visualize our workflow before we run it. Run the following code in your terminal:

This command calls snakemake, specifies the configuration files, and pipes the DAG (Directed Acyclic Graph) of the workflow generated by snakemake to the dot tool withing the graphviz program to generate a PDF that you can view in any PDF viewer.

We'll do a quick dry run to check for errors as follows, where the -n flag specifies that this is a dry run (output files won't be generated):

And finally, let's run it for real, assuming there are no error messages. Copy the code below to your terminal:

The --use-conda flag specifies that the code should be run in a conda environment, which is necessary here since we're relying on the bowtie2 and samtools software to be installed within a conda environment.

As mentioned above, the --config flag allows us to specify a configuration file.