Command line tools for quality control

Last updated on 2024-09-08 | Edit this page

Estimated time: 7 minutes

Overview

Questions

How to perform quality control of NGS raw data?

Objectives

Assess short reads FASTQ quality using FASTQE
Compare reads before and after quality filtering with fastp

Introduction

During sequencing, the nucleotide bases in a DNA or RNA sample (library) are determined by the sequencer. For each fragment in the library, a sequence is generated, also called a read, which is simply a succession of nucleotides.

Modern sequencing technologies can generate a massive number of sequence reads in a single experiment. However, no sequencing technology is perfect, and each instrument will generate different types and amount of errors, such as incorrect nucleotides being called. These wrongly called bases are due to the technical limitations of each sequencing platform.

Therefore, it is necessary to understand, identify and exclude error-types that may impact the interpretation of downstream analysis. Sequence quality control is therefore an essential first step in your analysis. Catching errors early saves time later on.

To take a look at sequence quality along all sequences, we can use FASTQE. It is an open-source tool that provides a simple and fun way to quality control raw sequence data and print them as emoji. You can use it to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

We will also use the tool fastp which can be used to filter sequence data to remove low quality reads.

Instructor Note

Inline instructor notes can help inform instructors of timing challenges associated with the lessons. They appear in the “Instructor View”

Challenge 1: Is FASTQE installed? What version?

What is the output of this command?

BASH

$ fastqe --version

Output

OUTPUT

fastqe 1.0

Challenge 2: Can you do the same for `fastp`? Hint: try `-v` as well as `--version`

BASH

$ fastp --version # or use -v

Callout

When using the command line code, lines that start with the # sign are typically text descriptors or instructions, not lines of code.

The # character is used to indicate comments. A line starting with # line doesn’t do anything, in this case it’s just telling you what the next line of code does which is navigate to your computer’s desktop - We will stick with this convention throughout the activity - Comments can occur anywhere in a line, anything after the # will be ignored

Show me the solution

OUTPUT

$ fastp --version
fastp 0.20.1

You should get the same result from using -v and --version which are known as short and long options respectivlely. Long options will typically (but not always) have a corresonidng short option.

Command line structure

We have mentioned commands and options so far. The syntax is typically as follows:

Challenge 3: Run this command:

BASH

$ ls -F /

Output

OUTPUT

dev/  home/  proc/  shared/  tmp/

This is the filesystem of our browser-based portal.

Case Study

Here’s brief background on the in class metagenomics project that we’re collecting sequencing data for. Garter snakes excrete sexually dimorphic pheromones to attract a mate. The hypothesis of our experiment is that male and female garter snakes host unique microbial communities in their mouths, cloacae and musk glands that contribute to sexually dimorphic bioengineering of these pheromone molecules. Figure 1 provides an overview of our 16S metagenomics analysis pipeline. For this lesson though, all you need are the FASTQ files.

The data for this experiment is located in the /shared/fastqe/ folder.

Challenge 4: What are the files available from this experiment?

Hint: Use the ls command we have already seen

Show me the solution

BASH

ls -F /shared/fastqe/

OUTPUT

female_musk2.fastq.gz  female_oral2.fastq.gz

We have two files, one for each of the snake tissues sampled.

Key Points

Options can be long and short
Structure of a shell command
We have confirmed fastp and fastqe are installed
The case study and location of the data

Command line tools for quality control

Overview

Questions

Objectives

Introduction

Instructor Note

Challenge 1: Is FASTQE installed? What version?

BASH

Output

OUTPUT

Challenge 2: Can you do the same for fastp? Hint: try -v as well as --version

BASH

Callout

Show me the solution

OUTPUT

Command line structure

Challenge 3: Run this command:

BASH

Output

OUTPUT

Case Study

Challenge 4: What are the files available from this experiment?

Show me the solution

BASH

OUTPUT

Key Points

Challenge 2: Can you do the same for `fastp`? Hint: try `-v` as well as `--version`