The goal of this exercise is look at basic QC data from Illumina sequencing reads and to use a trimming tool to remove adapter sequences and poor quality bases. Our dataset are two M. tuberculosis samples that have been sequenced via Illumina paired-end. We will be using this data for this module and the next to focus on the basics of data QC and for mapping reads to a reference genome.
fastqc on Illumina readsmultiqc to combine and visualize multiple fastqc reportsconda environmenttrim_galore to trim Illimina fastq dataFirst move into the TB dataset folder:
cd TB_module
If you list the contents of this directory ll it should appear like the following:

We are working with two samples that have been sequenced with Illumina via paired-end sequencing. Therefore, each sample has a forward and reverse read. The forward read is usually denoted by SAMPLE-NAME_1.fastq.gz and the reverse is usually denoted by SAMPLE-NAME_2.fastq.gz
The program fastqc is widely used to assess the quality metrics of Illumina sequencing data. Here we will be running fastqc on both of the TB samples before and after we perform read and adapter trimming.
As with nearly all programs you can get see how to run the program by either just typing the program name, in this case fastqc or typing fastqc -h.
fastqc in default mode simply expects fastq files after the program name like so : fastqc SAMPLE_1.fastq.gz. We have to run the program for each the forward and the reverse reads per sample – so a total of 4 files. Thankfully we can list all the files at one time.
Run fastqc on each of the files:
fastqc TBsample1_1.fastq.gz TBsample1_2.fastq.gz TBsample2_1.fastq.gz TBsample2_2.fastq.gz

When this has finished, we are now going to use a related program called multiqc to combine and visualize the output of fastqc.
Run multiqc here and don’t forget the period at the end of this command! The period simply designates run “here” in the current folder
multiqc .

Type ls to see all the new files we have generated

| File type | Description |
|---|---|
| .fastq.gz | original fastq data |
| _fastqc.zip | data folder for output of fastqc compressed as zip file |
| _fastqc.html | fastqc results per file that can be viewed in web browser |
| multiqc_data | multiqc combines the data across all fastqc and stores them in this folder |
| multiqc_report.html | results of multiqc that can be viewed in web browser |
The main output for us to view from multiqc is the multiqc_report.html file. We can either open this file by using the file browser and clicking on the html file, or we can open a web browser from the command line.
To view results type:
firefox multiqc_report.html
This will open firefox and load the file into an interactive session.

There are many different statistics and QC analysis that is output by fastqc / multiqc. Here we are going to highlight the most important ones, but I encourage you to look over all of them and use the nice built-in descriptions to learn more.
The first is the nice General Statistics table at the very start. This is s nice summary per file (remember it is a separate file for both forward and reverse reads per sample)

This is THE most important plot. This plot shows you the average quality score per base across summarized across the read length.
Sequences with good quality scores across the read are colored green.
Sequences with poor quality scores across the read are colored red.

A related plot is the per sequence quality scores histogram which gives the total number of reads that have a given mean quality score.

Another important plot is to look at the mean GC content of the reads. This plot often gives us an indication of whether or not we have contamination. If you are sequencing a single organism in your sample(s), the GC content should follow a normal ‘bell’ distribution. If there happened to be contaminating DNA from another organism it will likely have a slightly different mean GC and this shows up a ‘bump’ in the distribution.

There are other useful plots and analyses that are in the fastQC / multiQC report. I encourage you to look at these and explore the data!
Control+c will allow you to close firefox at the terminal when done viewing the report
↥ Back to top ***
Next, we will be using the tool trim_galore to trim (ie remove) poor quality data and contaminating adapter sequences. trim_galore takes about 10 minutes to finish per dataset.
Trim Galore is a tool that runs the program cutadapt and fastQC. You can find more information about these tools here:
Trim Galore : https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.mdcutadapt: https://cutadapt.readthedocs.io/en/stable/conda to install and download tools First we will use conda to install the trim_galore program. conda is a package management tool that allows one to create different “environments” and takes the pain out of downloading software. I have already installed conda but you can find the details how I did this here : https://docs.conda.io/en/latest/miniconda.html
Please run:
conda create --yes -n qc -c bioconda -c conda-forge python=3.8 trim-galore fastqc multiqc
This creates a new environment called ‘qc’ and installs the programs trim_galore , fastqc, multiqc and python3.8
You activate the new environment with the command:
conda activate qc

Now we are able to use trim_galore to process the fastq file for each of the TB samples.
Run trim_galore on Sample 1
trim_galore --cores 4 --paired --fastqc TBsample1_1.fastq.gz TBsample1_2.fastq.gz
This command has several options that we used:
--cores 4 : use 4 CPUs (the max on our VMs) so that it runs faster
--paired : let’s the program know we have paired-end data
--fastqc : automatically runs fastQC when finished
!!! This step is going to take around 10 minutes to finish. While waiting please go to Download Reference Genome section.
When finished trim_galore produces the following files per fastq file:
| File type | Description |
|---|---|
_val_1.fq.gz or _val_2.fq.gz |
trimmed fastq file |
| _trimming_report.txt | summary output of trim_galore command |
| _val_1_fastqc.zip | data folder for output of fastqc compressed as zip file |
| _val_1_fastqc.html | fastqc results from trimmed fastq that can be viewed in web browser |
trim_galore for the second sample:trim_galore --cores 4 --paired --fastqc TBsample2_1.fastq.gz TBsample2_2.fastq.gz
This will take around 10 minutes to complete.
Use this time to look over the fastQC html files or the multiQC html files we have already generated and see if there are any questions.
Once the command above has finished use ls to see all the files we have generated:

At this point we should have fastQC data for our original fastq files as well as the trimmed reads.
multiqc -f .
-f : overwrites the previous multiqc report
firefox multiqc_report.html
Remember Control+c will allow you to close firefox at the terminal when done viewing
Look through the new multiQC report to see the difference trimming has made (particularly for TBSample2 data)

The reference genome Mycobacterium tuberculosis H37Rv can be found here : https://www.ncbi.nlm.nih.gov/nuccore/AL123456.3/
Follow the steps below for downloading the GenBank genome file:





We will be using this reference genome file for our upcoming mapping module.