In this module, we are continuing our SARS-CoV-2 module from last week. This week we will assign Pangolin and NextClade lineages to our sequences. We will also then merge our data with a subset of globally sampled genomes, align these sequences using mafft and then create a phylogeny so that we can identify relationships between isolates.
cd Oman_modules
git pull
cd SARS_CoV-2
ls

Recall the output files from the ARTIC pipeline we ran last module. We are particularly interested in the consensus genome files produced. You can see those files specifically by using the * trick like so :
ls *.consensus.fasta

Then we can combine these files together using cat:
cat *.consensus.fasta > all.consensus.fasta
We can check how many entries are in a fasta file by using grep
!!! It is very important to remember the ">" in this command !!!
This is because > is the redirect output symbol for creating a new file. If we forget to quotes then we will overwrite our file! ">" versus >!!
grep -c ">" all.consensus.fasta
-c = count the number of matches
Here we are searching for matches to >, which is at the start of every fasta entry

Pangolin has become the default tool for assigning lineages for SARS-CoV-2. You can learn more about the actual tool from the website:
https://cov-lineages.org/resources/pangolin.html
pangolin using mamba:mamba create --name pango
conda activate pango
mamba install -c conda-forge -c bioconda -c defaults pangolin
Check that the installation worked:
pangolin -v
pangolin -pv
As new lineages and variants arise, pangolin must be updated regularly. To update run:
pangolin --update

pangolin on our consensus genomes:pangolin all.consensus.fasta

The output file created by pangolin is called lineage_report.csv.
You can view it using less:
less lineage_report.csv
To exit less type q when you are finished reading.
The descriptions of the output can be found here: https://cov-lineages.org/resources/pangolin/output.html
As there are many columns in the lineage_report.csv file it is hard to interpret the file using the less command. We can use a few tricks to help us see the output more clearly:
Recall that we can link commands together using a pipe |
cat lineage_report.csv | cut -d, -f1-3,5,14 | column -t -s ","
cat : prints file to the screencut: command used to select which columns. -d sets the delimiter. In our case it is a ,. -f sets the column numbers we wish to keep. Here we select 1 through 3,5, and 14.column used so that it prints the output nicely aligned. -t means tab separate. -s sets the delimiter. Here again we set this as ","You can try these commands one at a time and see what the output is doing at each step.
Much better!

Of course, you can also read the lineage_report.csv into Excel or the equivalent in Linux, which is LibreOffice Calc.

We have a comma separated file, so set the import options accordingly:

NextClade is another popular tool for lineage calling and QC. In fact, we used the web-based version of NextClade in our first module, if you remember: https://domman-genomics.github.io/Oman_NGS/manuals/01_Intro_to_NGS/module_Intro.html
Here we are going to use the command-line version. You can learn more about the tool here: https://docs.nextstrain.org/projects/nextclade/en/stable/user/nextclade-cli.html
NextClade using mamba:mamba install -c bioconda nextclade
We can install NextClade directly in our pango conda envirionment. These tools play nice with each other!
We can check it installed correctly and learn how to use the tool:
nextclade --help

nextclade dataset get --name 'sars-cov-2' --output-dir 'data/sars-cov-2'
nextclade \
--in-order \
--input-fasta all.consensus.fasta \
--input-dataset data/sars-cov-2 \
--output-tsv output/nextclade.tsv \
--output-tree output/nextclade.auspice.json \
--output-dir output/ \
--output-basename nextclade
The --input-fasta flag is where we specify our input fasta file. We do not need to change the other options.
The output of the command we specified will be placed in a new folder called output, using the --output-dir flag.
I think it is easiest to view this using Excel or equivalent. Open it as we did the pango results.

This is a tab separated file:

There are quote a few output columns. You can see all of the output descriptions from the website: https://docs.nextstrain.org/projects/nextclade/en/stable/user/output-files.html
But here are a few important ones:
| Column name | Meaning |
|---|---|
| seqName | Name of the sequence (as provided in the input file) |
| clade | Assigned clade |
| qc.overallScore | Overall quality control score |
| qc.overallStatus | Overall quality control status |
| totalSubstitutions | Total number of detected nucleotide substitutions |
| totalDeletions | Total number of detected nucleotide deletions |
| totalInsertions | Total number of detected nucleotide insertions |
| totalAminoacidSubstitutions | Total number of detected aminoacid substitutions |
| totalAminoacidDeletions | Total number of detected aminoacid deletions |
| totalMissing | Total number of detected missing nucleotides |
| totalNonACGTNs | Total number of detected ambiguous nucleotides |
| totalPcrPrimerChanges | Total number of nucleotide mutations detected in PCR primer regions |
| substitutions | List of detected nucleotide substitutions |
| deletions | List of detected nucleotide deletion ranges |
| insertions | List of detected inserted nucleotide fragments |
mafftThe next step is to place our sequences within context of other globally representative strains. But before we can create a phylogeny, we have to ensure our sequences are aligned. We will be using the mafft aligner for this.
We will also download a tool that allows you to visualize our sequence and alignment data.
jalviewmamba install -c conda-forge -c bioconda jalview
First lets copy over the representative set from the NextClade data folder we downloaded earlier:
cp data/sars-cov-2/sequences.fasta .
Now we need to add our samples to this set. Let’s use the cat command to merge these two files.
cat all.consensus.fasta sequences.fasta > combined_seqs.fasta
Use grep to count how many sequences we have total in the alignment now:
grep -c '>' combined_seqs.fasta
Now we align the sequences with mafft like so:
mafft --thread 4 combined_seqs.fasta > combined_seqs_ALN.fasta
mafft [options] in_file > out_file
--thread : number of CPUs to use (we have 4 on the VM)
We can look at the alignment using jalview:
jalview -open combined_seqs_ALN.fasta

I find that turning on the color option to “Nucleotide” makes the alignment much easier to view.
There are many different programs used for creating phylogenies:
Today we will use a widely popular one called IQ-Tree. You can find more information about the programs and it’s many options here: http://www.iqtree.org
First we need to install iqtree:
mamba install -c conda-forge -c bioconda iqtree
We can create a phylogeny using this command:
iqtree -s combined_seqs_ALN.fasta -m GTR -T AUTO -B 1000
-s: input sequence alignment
-m : model of evolution. The general time reversible model (GTR) is usually a good choice.
-T : number of CPUs. IQ-Tree can auto find the optimal number with AUTO
-B: number of rapid bootstraps to assess statistical confidence in branch support
There are several outputs from iqtree:
.iqtree : the main report file that is self-readable. You should look at this file to see the computational results.
.treefile : the ML tree in NEWICK format, which can be visualized by FigTree or iTOL
.log: log file of the entire run
We need to install figtree to view the phylogeny.
mamba install -c conda-forge -c bioconda figtree
Once installed lauch figree to view out tree:
figtree combined_seqs_ALN.fasta.treefile
Click OK when the popup menu appears.
We can then re-root the tree like so for better viewing by selecting Midpoint root in the left hand menu:
