This is the first of a collection of assignments provided to students in Great Ideas in Computational Biology in which we will use data and software produced by others to draw conclusions about the SARS-CoV-2 virus that has caused the COVID-19 pandemic. In this sense, we are “research parasites”, a term coined by a now infamous 2016 editorial in the New England Journal of Medicine:
A … concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front- line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”
Just as we started this course with a discussion of how genomes are assembled, we start our study of SARS-CoV-2 by assembling its genome. This task was first completed on January 25, 2020 using a sample collected from the first known human patient .
Many patient DNA or RNA samples taken from swabs are metagenomic, meaning that they capture genetic material not only from the virus but also other viruses, bacteria, and even human cells. This means that researchers either have to apply an algorithm that is able to assemble genomes from information derived from multiple species, or they need to perform additional laboratory work to isolate the viral information. The researchers who assembled the first SARS-CoV-2 genome did the former, wrangling a 30,000 base pair genome out of a file consisting of 8 billion base pairs, most of which do not derive from SARS-CoV-2. If genome assembly is like assembling a puzzle, metagenome assembly is like assembling multiple similar puzzles from a jumbled set of pieces.
To prevent the complications of a metagenomics sample, we will work with a different viral read dataset that was sampled directly from an infected cultured cell.
Sequencing a virus is more straightforward than assembling a bacterium for a couple of reasons. First, the virus’s genome is much shorter. Second, SARS-CoV-2 is an RNA virus, meaning that its genome does not contain DNA, but rather a single (linear) strand of RNA, which is single-stranded; as a result, we do not encounter the complications that we do when assembling double-stranded DNA. Researchers convert RNA to DNA by using an enzyme called reverse transcriptase, a molecular tool that was invented by viruses to help them replicate their genome and now has been borrowed by humans to help us sequence this genome.
In this assignment, we will assemble the SARS-CoV-2 genome, and then we will annotate the contigs resulting from this assembly, meaning that we will compare our genome against a database of known genomes to label putative genes. Knowing the locations of these genes will allow us to study them in future work.
To assemble the genome from the reads, we will be using the SPAdes assembler that is built upon a de Bruijn graph approach and is the most cited genome assembler of all time. To run SPAdes, we will use the Australian service of Galaxy, an open source project that allows us to run often testy bioinformatics software in the cloud without the hassle of dealing with local installations. Please follow these step-by-step instructions to register on Galaxy and run SPAdes:
First, create an account on Galaxy at https://usegalaxy.org.au/login. You don’t need to use your university email address, because we are only grading the results of your analysis. Your “public name” can be whatever you like, but you should fill out all fields. After creating an account, log in to Galaxy.
Click on the plus ‘+’ on the top right of the page and create a new “history” (i.e., project that will contain data and the results of software runs) named “SARS-CoV-2”. All of the following analysis will be performed under this history.
The National Center for Biotechnology Information (NCBI) hosts an enormous amount of publicly available biological data. You may be interested in their website for SARS-CoV-2 (https://www.ncbi.nlm.nih.gov/sars-cov-2/), which contains sequencing data from hundreds of thousands of sequencing “runs”. These runs are stored in an arm of NCBI called the Sequence Read Archive (SRA), where researchers can upload sequencing reads from their experiments.
Our dataset of interest has the SRA identifier SRR11528307, and its homepage can be found here: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11528307. We can see at this page that the dataset contains approximately 719,800 reads (“spots”), that the technology is “Illumina”, and that the layout is “paired”. These are short paired-end reads produced by Illumina’s popular sequencing by synthesis approach that we learned about.
Exercise 1: How many nucleotide bases are found in this file? What is the GC-content of the reads? (That is, the percentage of reads that are either G or C.)
Exercise 2: We know that the original SARS-CoV genome from the 2003 outbreak has approximately 30,000 nucleotides. If the SARS-CoV-2 genome has length 30,000 nucleotides, then what is the coverage of this read dataset?
Clicking on the “Reads”
tab will allow us to see ten read-pairs at a time. Each read is in FASTA format, meaning that the read has a header beginning with the “>” symbol identifying the read, followed by the read itself.
Exercise 3: What is the read-pair at spot #15213? Why do you think that one read is shorter than the other?
Now that we know where our data is contained, let’s import it into Galaxy. To do so, go back to Galaxy and on the left side of the page, under "Get Data
“, click “Download and Extract Reads in FASTA/Q Format from NCBI SRA”
. Make sure that “select input type”
shows “SRR accession”
, and type our accession ID (SRR11528307
) into the field below. You may want to click “Yes”
for “Email notification”, although the job should not take more than five minutes to run. Then click “Execute”
. You should see the job added to your history on the right; it will turn green when it is finished.
When the job turns green, click on the “eye” symbol next to SRR11528307
under your history on the right side of the page. The page may take a few seconds to load, and Galaxy will show you the first megabyte of your read file.
This file is in FASTQ format, an extension of FASTA format in which each read is represented over four lines described as follows.
“@”
symbol and labeling the read. Notice that because these are read pairs, the first two reads’ headers are the same and end with “/1”
and “/2”
.“+”
symbol to indicate the end of nucleotides. This line may havemore header information, but doesn’t for this dataset.A given base returned as the result of a sequencing run is assigned a quality score Q based on an estimate of how likely this base is correct. Because many bases are very reliable, this computation is done on a logarithmic scale. In particular, once we compute the probability p that a base is incorrect, we set Q = -10 log10(p) and assign the base the ASCII symbol corresponding to Q + 33. For example, if p is 0.001, then log10(p) is -3, and so Q is equal to 30, and we assign this base the ASCII symbol corresponding to 30 + 33, which is “?”
. As a result, the lower the value of p, the higher the quality score Q. The Phred quality score table is shown in the figure below.
Phred scores are helpful for a variety of reasons. For example, researchers will throw out reads having a significant number of bases that do not meet a certain Phred score threshold, especially if the reads have high coverage.
Exercise 4: How would Phred scores be useful when applying the de Bruijn graph approach to genome assembly?
Exercise 5: In class, we learned about the sequencing by synthesis approached used by Illumina for sequencing reads. What do you think that Illumina considers to determine the quality score of a given base?
Exercise 6: Head back to the SRA page for our dataset. In the
“Metadata”
tab, there is small text that says“Quality graph”
. You may need to click“bigger”
to see this graph, which is a histogram of Phred quality scores over the entire dataset. Interpret this chart. Are the quality scores good? What makes you think so?
Exercise 7: Return to Galaxy and view your
SRR11528307
dataset. What are the quality scores of read“8595/1”
? Are they good? Are there any nucleotides you are worried about?
Now that we understand our data a bit better, we are ready to assemble the SARS-CoV-2 genome from our read set using SPAdes. Under Genomics Analysis → Assembly
on the left side of the page, you will see over a dozen different assemblers. Click on SPAdes
, which will take you to its homepage.
We will go through each of the settings offered and briefly explain them as well as what we are going to choose.
“No”
. This setting is used for bacterial projects when we have a bacterium that we cannot culture, and we only have DNA captured from a single cell.“No”
. We want SPAdes to try and “error correct” reads, meaning that it looks for potentially troubling bases and correcting reads. For example, if a read matches another read exactly except for a single base with a very low quality score, we may want to error correct this read before we build a de Bruijn graph.“Yes”
. The BWA tool is based on a Burrows-Wheeler transform approach to read mapping that we learn about later in the course.“Yes”
. SPAdes works by building de Bruijngraphs for several values of k. The default values are k = 21, 33, and 55. We have no a priori information that would lead us to change SPAdes’ default behavior, so we will let its “machine learning” choose the best values to use.“No”
as these are Illumina reads. ionTORRENT reads are produced by a different company (Oxford Nanopore) and are longer and typically less accurate.“Paired end/single reads”
. Remember that we know these are paired-end reads.“fr”
. Remember that the convention is to always read DNA in the 5’ to 3’ direction. In forward-reverse read pairs, the first read is taken from one strand, and its pair is taken from the opposing strand, as the following figure illustrates.Select file format. Set to “interleaved files”
. As you know, the read-pairs are found as consecutive reads in our fastq file. This is called “interleaved”
as opposed to having the forward reads in one file and the reverse reads in another.
SRR11528307
dataset for this field.“Yes”
as we would love to know the finalassembly graph produced by SPAdes.“No”
. Scaffolds are formed by trying tojoin contigs into longer contiguous sequences. We will only view the graph of thecontigs.“Yes”
so that you can leave this running andget notified when it’s done. You are now ready to go! Click “Execute”
.SPAdes should take 15-20 minutes to finish.
Now that SPAdes has completed, we will analyze the results. As the previous exercise indicates, our work appears to have been quite successful.
Exercise 8: Click on the eye symbol (“view data”) next to the “contigs (fasta)” line on the right side. How many contigs did our assembly produce? What are their lengths? What do you think “coverage” means in this context?
We also produced an “assembly graph” as the result of our work, which is a version of the de Bruijn graph (after some cleaning steps involving bubble removal) in which maximal nonbranching paths have been compressed to a single edge. If you “view data” for the assembly graph, you will see that this graph has four edges.
Let’s visualize this assembly graph! To do so, we use a program called Bandage. Under “Assembly” on the left side of the page, click on “Bandage Image”. Under “Graphical Fragment Assembly” select the assembly graph that you just produced. Feel free to keep all other parameters the same (possibly opting for an email notification), and click “Execute”. The program should only take a couple of seconds to run, after which clicking “view data” on the run will show the graph.
Exercise 9: Show the image produced of the assembly graph. Where do you think the contigs of our assembly are hiding? Where do you think contig 2 exists in the genome?
Because we have a single long contig, and the virus genomes are ultimately short, let’s align it against the original SARS-CoV genome from the 2003 outbreak to see how the two coronavirus genomes differ.
Exercise 10: We will align the two genomes as DNA strings rather than translating them to protein strings. Why do you think this is?
To align the SARS-CoV-2 genome against the original SARS-CoV genome, we will use the (affine) global alignment algorithm that we learned about in class, provided at NCBI at https://bit.ly/2NtCsWX under the accession ID NC_004718.3
. To import this genome, we will use the “NCBI Accession Download”
tool under “Get Data”
on the left side of the Galaxy page. After clicking on this tool, follow these steps.
“Select source for IDs”
, select “Direct Entry”
.NC_004718.3
.“Execute”
.The tool should be very quick.
Now we are ready to align the two genomes. Under the “EMBOSS” section on the left side of the page, click “needle”, which is short for “Needleman-Wunsch global alignment”, the algorithm we learned about in class.
For sequence 1, select the contigs that we produced as the result of SPAdes. For sequence 2, select NC_004718.3
. Keep the other parameters default (you may like to play around with these parameters later to see how they affect the outcome). After opting for an Email notification, click “Execute”
. The algorithm should take a few minutes to run.
When the algorithm has finished, view its data. This will show the alignment, in which matches are shown with “|”
symbols, mismatches are shown with “.”
symbols, and indels do not have a symbol between the two rows.
Exercise 11: How many matched symbols does the alignment report? How many gap symbols?
Exercise 12: What do you notice about the ends of the alignment? What do you think happened when the genome was assembled?
Exercise 13: Scroll through the interior of the alignment. Do you notice any regions that appear to be more variable than other regions?=
Finally, we will annotate the SARS-CoV-2 genome, meaning that we will identify putative genes and then compare these genes against a database of known genes to find which ones align the best.
The identification of putative genes can be done in a variety of ways. One way to find these regions requires the observation that regions of DNA that are translated into protein start with a start codon (ATG
) and end with a stop codon (TAG
, TAA
, or TGA
). If a genome were random, then after encountering a start codon, we would expect to see a stop codon after about 64/3 ~ 21 codons. But real genes are typically much, much longer than 21 codons. As a result, in simple organisms like viruses and bacteria, a great way to find genes is to look for long stretches of codons connecting a start codon to a stop codon, with no intermediate stop codons.
Once we have found putative genes along the genome, we can compare them against a database of genes using an algorithm called BLAST that we will learn about soon.
These two steps are taken by our next tool, called Prokka, which is used to annotate the genomes of viruses, bacteria, and archaea. It can be found under “Annotation”
on Galaxy. Under “Contigs to annotate”
, select our contigs from the SPAdes run. Set “Kingdom”
to “Viruses”
and leave all other parameters to default. Note that our shorter contig will not be included in the annotation because Prokka only annotates contigs of length at least 200. Select whether you would like an Email on completion, and click “Execute”
.
The entire process of annotating our genome should not take more than a few seconds. Isn’t bioinformatics great?
Prokka produces a collection of output files, which you might like to view. The .fna
file contains the lone contig that survived. The .gff
file contains regions identified by Prokka as genes. The .ffn
file contains nucleotide sequences of the annotated genes, and the .faa
file contains their translated amino acid sequences.
Exercise 14: First, view the
.gff
file containing regions identified by Prokka as putative genes. How many are there? What is the longest and shortest one?
Prokka contains the annotation information in these files, but what a biologist would really like is a visualization of its genes and what they have been predicted to be. So to visualize our annotation contained in the .gff
file, we will use a genome browser tool called “JBrowse”
that is found in the “Graph/Display”
section on the left side of the page.
When running JBrowse, take the following steps.
“Use a genome from history”
and choose the .fna
file.“Insert Track Group”
.“Insert Annotation Track”
..gff
file from the Prokka output.“Email notification”
, choose “Yes”
if you like.“Execute”
.JBrowse should be very quick. When it is finished, click view file
, which is an HTML file that we can view in the browser. (It may take a moment to load.)
It seems like nothing is there, but on the left side of the page, click on “Prokka on data XXX: gff”
to show our beautiful annotation of the SARS-CoV-2 genome. Zoom out to see it in all its glory. All the arrows point in the same direction to indicate that the genes are all found on the same strand of the genome. This makes sense because SARS-CoV-2 is an RNA virus, meaning that its genome has only one strand. (It would have been a very bad sign if some genes pointed in the opposite direction.)
You can click on the genes as you like to obtain information about their length and what in the database they were found to be similar to (if anything). For example, if you click on the first protein (Replicase polyprotein 1a), you will find that it was found to be similar to the protein with Uniprot ID (https://www.uniprot.org/uniprot/P0C6U8), which is the same gene in SARS-CoV. Note that the annotation score of this protein is high because there is so much experimental evidence backing up this protein. This means that if we find a putative gene that is a hit against it, we can have a high degree of certainty that our gene serves a similar function.
Exercise 15: Why do you think that two of the genes are labeled “hypothetical proteins”?
It may amaze you that such a tiny thing can wreak so much havoc. But now that we know this annotation, we can start investigating the virus’s individual genes. Which one should we focus on? How has the gene mutated as the virus spread around the world? And how do these mutations affect the function of the protein? We save these topics for the subject of future study.