The purpose of this exercise is to introduce bash scripting for creating command line workflows.
Script: A series of commands or instructions to automate a task. The commands are written in a text file that is then executed by a program without being first compiled (converted into the binary machine code).
Scripting language: A computer programming language that supports scripts. The scripts are typically interpreted by the program and do not have to be compiled.
In this exercise, we’ll discuss variable assignment and manipulation. The goal is to demonstrate the need to automate repeated tasks, which will be done in the next exercise in which we introduce scripts.
1.1. Open a shell or terminal window.
1.2. Assign a DNA sequence to a variable (sequence
):
%%bash
sequence=ACTGTACGGTACAC
Variable assignment in bash does not allow spaces before or after =
. The variable will be stored until you terminate your terminal session.
To reference a stored variable, we need to prepend a $
to it. So to get the stored value of sequence
, we would use $sequence
. See below.
1.3. Complement the sequence using tr
:
%%bash
echo $sequence | tr [ACTGactg] [TGACtgac]
tr
generally makes single character substitutions, in this case, substituting each character in the left set of brackets with a character in the same position in the right set of brackets. The number of characters must be the same in each set of brackets. The brackets themselves are optional.
1.4. Reverse the sequence using rev
:
%%bash
echo $sequence | rev
The rev
command reverses a string of characters.
1.5. Reverse complement the sequence using tr
and rev
:
%%bash
echo $sequence | rev | tr [ACTGactg] [TGACtgac]
Notice that we can redirect the output from one command into another using a pipe (|
).
1.6. Calculate the length of the sequence using wc
:
%%bash
echo -n $sequence | wc
The word count command, wc
, returns the number of lines, words, and characters in that order.
In this exercise, we will write a bash script to identify the length, the complement, the reverse, and the reverse complement of a DNA sequence.
A good text editor is essential for writing scripts. Microsoft Word and Mac TextEdit should not be used for writing scripts.
Common Free Text Editors:
2.1. Open a shell or terminal window.
2.2. Create a new directory named bash_scripts
using mkdir
.
2.3. Change into the bash_scripts
directory using cd
.
2.4. Open a text editor, such as BBEDIT
or Notepad++
, and use it to create a new file called revcomp.sh
within the bash_scripts
directory. In the next steps, we will write a bash script by adding commands to the file.
2.5. Confirm that you are in the bash_scripts
directory using pwd
and that the revcomp.sh
script is in the directory using ls
.
2.6. Within revcomp.sh
prompt the user for a sequence and store as a variable (seq
) using the commandread -p
:
read -p "Enter a sequence: " seq
The -p
flag allows you to print a message to the terminal at the prompt, in this case "Enter a sequence: "
. Whatever the user enters at the prompt will be saved in the variable seq
.
2.7. Complement the sequence using tr
and store it as a new variable (comp
):
comp=`echo $seq | tr [ACTGactg] [TGACtgac]`
Notice that to capture the output of a command, we need to enclose the command in backticks ( ` ).
2.8. Reverse the sequence using rev
and store it as a new variable (reverse
):
reverse=`echo $seq | rev`
2.9. Reverse complement a sequence using tr
and rev
and store it as a new variable (revcomp
):
revcomp=echo $seq | tr [ACTGactg] [TGACtgac] | rev
2.10. Compute the length of the sequence using wc -m
(the -m
flag modifies wc
so that it only returns the number of characters) and store as a new variable (length
):
length=echo -n $seq | wc -m
2.11. Print to the shell the values of the variables generated in each of the above steps along with a brief description:
echo ""
echo "Original sequence: $seq"
echo "Complement: $comp"
echo "Reverse: $reverse"
echo "Reverse complement: $revcomp"
echo "Length: $length"
echo ""
What does the command echo ""
print to the terminal?
2.12. Return to your terminal window and execute the script using bash revcomp.sh
.
2.13. Looping is a technique in computer programming that allows you to repeat (loop) over a task. Edit revcomp.sh
to repeat indefinitely using a while
loop as follows (this is called an infinite loop). The format of a while
loop in bash is:
while some condition is true
do
some block of code to run
done
2.13.a. Add while
before the read command.
while read -p "Enter a sequence: " seq
In this example, we have an infinite loop becasue the condition read -p "Enter a sequence: " seq
will always evaluate to True
. We'll demonstrate how to break out of the loop below.
2.13.b. Insert a new line with the command do
directly after the previous line containing while
:
do
2.13.c. At the end of the script add a new line with the command done
:
done
2.14. Return to your terminal window and run the script again.
Notice that because of the infinite while
loop, the only way to quit the program is to abort the command with ctrl+c
. Let's fix that issue.
2.15. Edit revcomp.sh
to exit if the user just hits return instead of entering a sequence using a conditional statement.
Like loops, conditionals are a common feature of programming languages. In bash, a conditional statement has the following syntax:
if [[ something == something ]] # == means equal to
then
some block of code
fi # closes the conditional
2.15.a. Directy after the read
and do
statements, check if the variable contains an empty string, which is what will occur if the user just hits return, using a conditional statement:
if [[ $seq == "" ]] then echo "" echo "todo listo. adios!" echo "" exit fi
exit
will abort the script.
2.16. Return to your terminal window and execute the script using bash revcomp.sh
. Does the program terminate if you hit return without entering a sequence?
Your final script should look something like this:
while read -p "Enter a sequence: " seq
do
if [[ $seq == "" ]]
then
echo ""
echo "todo listo. adios!"
echo ""
exit
fi
comp=`echo $seq | tr [ACTGactg] [TGACtgac]`
reverse=`echo $seq | rev`
revcomp=`echo $seq | tr [ACTGactg] [TGACtgac] | rev`
length=`echo -n $seq | wc -m`
echo ""
echo "Original sequence: $seq"
echo "Complement: $comp"
echo "Reverse: $reverse"
echo "Reverse complement: $revcomp"
echo "Length: $length"
echo ""
done
Note that bash doesn't care about indentation so it's technically optional.