Quality Control for Next Generation Sequencing: All in One View

Content from Command line tools for quality control

Last updated on 2024-09-08 | Edit this page

Overview

Questions

How to perform quality control of NGS raw data?

Objectives

Assess short reads FASTQ quality using FASTQE
Compare reads before and after quality filtering with fastp

Introduction

During sequencing, the nucleotide bases in a DNA or RNA sample (library) are determined by the sequencer. For each fragment in the library, a sequence is generated, also called a read, which is simply a succession of nucleotides.

Modern sequencing technologies can generate a massive number of sequence reads in a single experiment. However, no sequencing technology is perfect, and each instrument will generate different types and amount of errors, such as incorrect nucleotides being called. These wrongly called bases are due to the technical limitations of each sequencing platform.

Therefore, it is necessary to understand, identify and exclude error-types that may impact the interpretation of downstream analysis. Sequence quality control is therefore an essential first step in your analysis. Catching errors early saves time later on.

To take a look at sequence quality along all sequences, we can use FASTQE. It is an open-source tool that provides a simple and fun way to quality control raw sequence data and print them as emoji. You can use it to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

We will also use the tool fastp which can be used to filter sequence data to remove low quality reads.

Challenge 1: Is FASTQE installed? What version?

What is the output of this command?

BASH

$ fastqe --version

Output

OUTPUT

fastqe 1.0

Challenge 2: Can you do the same for `fastp`? Hint: try `-v` as well as `--version`

BASH

$ fastp --version # or use -v

Callout

When using the command line code, lines that start with the # sign are typically text descriptors or instructions, not lines of code.

The # character is used to indicate comments. A line starting with # line doesn’t do anything, in this case it’s just telling you what the next line of code does which is navigate to your computer’s desktop - We will stick with this convention throughout the activity - Comments can occur anywhere in a line, anything after the # will be ignored

Show me the solution

OUTPUT

$ fastp --version
fastp 0.20.1

You should get the same result from using -v and --version which are known as short and long options respectivlely. Long options will typically (but not always) have a corresonidng short option.

Command line structure

We have mentioned commands and options so far. The syntax is typically as follows:

Challenge 3: Run this command:

BASH

$ ls -F /

Output

OUTPUT

dev/  home/  proc/  shared/  tmp/

This is the filesystem of our browser-based portal.

Case Study

Here’s brief background on the in class metagenomics project that we’re collecting sequencing data for. Garter snakes excrete sexually dimorphic pheromones to attract a mate. The hypothesis of our experiment is that male and female garter snakes host unique microbial communities in their mouths, cloacae and musk glands that contribute to sexually dimorphic bioengineering of these pheromone molecules. Figure 1 provides an overview of our 16S metagenomics analysis pipeline. For this lesson though, all you need are the FASTQ files.

The data for this experiment is located in the /shared/fastqe/ folder.

Challenge 4: What are the files available from this experiment?

Hint: Use the ls command we have already seen

Show me the solution

BASH

ls -F /shared/fastqe/

OUTPUT

female_musk2.fastq.gz  female_oral2.fastq.gz

We have two files, one for each of the snake tissues sampled.

Key Points

Options can be long and short
Structure of a shell command
We have confirmed fastp and fastqe are installed
The case study and location of the data

Content from The FASTQ format and FASTQE

Last updated on 2024-09-08 | Edit this page

Introduction

The first step of any Next Generation Sequencing (NGS) analysis pipeline is checking the quality of the raw sequencing reads in each FASTQ formatted file. If the sequence quality is poor, then your resulting downstream analysis will be inaccurate and misleading. FastQC is a popular software used to provide an overview of basic quality metrics for NGS data. In this lesson, you will use an even more universal form of communication to analyze FASTQ files, THE EMOJI.

The tool we will use for this visualisation is fastqe

Challenge 1: Can you run `fastqe`?

What is the output of this command from your terminal?

BASH

fastqe --help

Output

If you have fastqe installed correctly, you should get usage information:

OUTPUT

$ fastqe --help
Read one or more FASTQ files, compute quality stats for each file, print as emoji... for some reason.😄

🚨 Rust and WebAssembly beta: only command line options with a  ✅  are functional
 rustc 1.75.0-beta.7 emcc 3.1.50

Usage: fastqe [OPTIONS] [FASTQ_FILE]...

Arguments:
  [FASTQ_FILE]...  Input FASTQ files

Options:
      --noheader              Hide the header before sample output ❌
      --output <OUTPUT_FILE>  Write output to OUTPUT_FILE instead of stdout
      --long <READ_BUFFER>    Buffer memory for long reads up to READ_BUFFER bp long ❌ [default: 500]
      --log <LOG_FILE>        Record program progress in LOG_FILE ✅
  -h, --help                  Print help
  -V, --version               Print version

Emoji options:
      --bin                   Use binned scores (🚫 💀 💩 🚨 😄 😆 😎 😍 ) ❌
      --custom <CUSTOM_DICT>  Use a mapping of custom emoji to quality in CUSTOM_DICT (🐍🌴) ❌
      --noemoji               Use mapping without emoji (▁▂▃▄▅▆▇█) ❌

Statistics options:
      --minlen <N>  Minimum length sequence to include in stats ✅ [default: 0]
      --scale       Show relevant scale in output ❌
      --nomean      Hide mean quality per position ❌
      --min         Show minimum quality per position  ❌
      --max         Show maximum quality per position ❌

HTML report options:
      --html         Output all data as html ❌
      --window <W>   Window length to summarise reads in HTML report ❌ [default: 1]
      --html_escape  ❌ Escape html within output, e.g. for Galaxy parsing

For more information, vist https://fastqe.com

FASTQE has a lot of options! Note that this version does not implement all of them, however we will be using it in the default mode.

The FASTQ format

Although it looks complicated (and maybe it is), the FASTQ format is easy to understand with a little decoding.

Each read, representing a fragment of the library, is encoded by 4 lines:

Always begins with @ followed by the information about the read
The actual nucleic sequence
Always begins with a + and contains sometimes the same info in line 1
Has a string of characters which represent the quality scores associated with each base of the nucleic sequence; must have the same number of characters as line 2

So for example, the first sequence in our file is:

@M00970:337:000000000-BR5KF:1:1102:17745:1557 1:N:0:CGCAGAAC+ACAGAGTT
GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(

It means that the fragment named @M00970 corresponds to the DNA sequence GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA and this sequence has been sequenced with a quality GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(.

PHRED scores

THE FASTQ format encodes quality as ASCII values, which are mapped to quality scores in the PHRED format. The PHRED score is calculated from P, the probability of a base-calling error:

\(Q = -10\log{P}\)

These values are linked by the Phred Quality Score, Probability of incorrect base call, and Base call accuracy. We can summarise these:

PHRED	Probability	Accuracy
10	1 in 10		90%
20	1 in 100	99%
30	1 in 1000	99.9%
40	1 in 10,000	99.99%
50	1 in 100,000	99.999%
60	1 in 1,000,000	99.9999%

FASTQE aids in the interpretation of these by mapping the Phred Quality Score and ASCII code to an Emoji:

PHRED	ASCII	Emoji
0	!	🚫
1	”	❌
2	#	👺
3	$	💔
4	%	🙅
5	&	👾
6	’	👿
7	(	💀
8	)	👻
9	*	🙈
10	+	🙉
11	,	🙊
12	-	🐵
13	.	😿
14	/	😾
15	0	🙀
16	1	💣
17	2	🔥
18	3	😡
19	4	💩

Much easier to understand! Here we show the emoji for Q scores below 20, which represents a 1 in 100 probability of a base error. Given there can be many thousands of reads and bases in even a small sequeuncing sample, the impact of these errors on downstream analysis and conclusions is significant. Steps we will see later in the lesson can be used to remove low quality reads.

Case Study

Let’s look at the output of FASTQE on our case study data:

BASH

$ fastqe /shared/fastqe/female_musk2.fastq.gz

which should give the following:

OUTPUT

$ fastqe /shared/fastqe/female_musk2.fastq.gz
Filename        Statistic       Quality
/shared/fastqe/female_musk2.fastq.gz    mean    😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁
😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁
😁 😁 😁 😁 😁 😉 😉 😁 😁 😁 😁 😁 😉 😁 😁 😁 😁 😁 😁 😉 😁 😉 😁 😁 😉 😁 😁 😁
😁 😁 😛 😜 😉 😁 😉 😉 😁 😁 😁 😁 😁 😉 😁 😁 😁 😁 😁 😁 😁 😁 😉 😉 😉 😉 😉 😁
😁 😉 😁 😁 😉 😉 😉 😛 😋 😄 😄 😆 😄 😆 😆 😆 😆 😆 😘 😆 😆 😆 😆 😘 😘 😘 😘 😘
😘 😘 😘 😘 😘 😘 😘 😘 😘 😘 😘 😘 😃 😃 😘 😘 😘 😘 😘 😃 😃 😃 😃 😃 😃 😃 😘 😘
😃 😘 😃 😃 😘 😃 😃 😃 😃 😃 😃 😃 😃 😃 😃 😚 😃 😃 😃 😃 😃 😃 😃 😃 😃 😚 😃 😃
😃 😃 😃 😃 😃 😚 😚 😚 😚 😚 😃 😚 😚 😚 😚 😃 😚 😚 😃 😃 😚 😚 😚 😃 😚 😚 😚 😚
😚 😚 😚 😚 😚 😚 😚 😚 😚 😗 😚 😚 😚 😗 😗 😗 😗 😗 😚 😚 😗 😗 😗 😗 😗 😗 😙 😗
😙 😙 😙 😙 😗 😙 😏 😊 😙 😙 😙 😊 😊 😏 😊 😊 😊 😏 😏 😅 😏 😅 😏 😊 😊 😊 😅 😅
😅 😅 😏 😅 😏 😅 😊 😏 😅 😏 😏 😅 😅 😅 😅 😏 😅 😅 😏 😅 😅 😀 😀 😀 😅 😀 😅 😀
😅 😀 🚨 😡

Challenge 5: What about the other sample?

Can you use fastqe on the other tissue sample? Which sample has better overall quality? Why?

Show me the solution

You should run fastqe with a different filename:

BASH

$ fastqe /shared/fastqe/female_oral2.fastq.gz

and the output will be:

OUTPUT

Filename        Statistic       Quality
/shared/fastqe/female_oral2.fastq.gz    mean    😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁
😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁
😁 😁 😁 😁 😁 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😁 😁
😁 😁 😉 😜 😛 😜 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😉 😁 😁 😉 😁 😁 😉 😉 😉 😉 😉
😉 😉 😉 😁 😉 😉 😉 😉 😛 😄 😘 😃 😃 😃 😚 😚 😚 😚 😚 😚 😗 😗 😚 😗 😗 😗 😗 😗
😙 😗 😗 😗 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙
😙 😙 😙 😙 😙 😙 😊 😙 😙 😙 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊
😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😏 😊 😊 😊 😏 😊 😏 😏 😏 😏 😊 😊 😊 😊 😊 😏 😏
😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😅 😅 😅 😅 😅 😅 😅
😅 😅 😅 😅 😅 😅 😀 😀 😀 😅 😀 😀 🚨 😀 😀 😀 😀 😀 😀 🚨 🚨 💩 🚨 🚨 😀 🚨 💩 💩
💩 🚨 🚨 🚨 🚨 💩 🚨 🚨 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩
💩 😡 💩 😾

The mouth (oral) sample has inferirior quality, on average, towards the end of the reads.

To look at the quality of these files in more detail, we will next use the emoji-less command line tool fastp.

Key Points

The options available for fastqe
The formula for Q scores
FASTQE maps scores to emoji
FASTQE can be used to quickly asses the quality of sequencing data

Content from Filtering with fastp

Last updated on 2024-09-08 | Edit this page

Filtering

The quality drop off observed could cause bias in downstream analyses with these potentially incorrectly called nucleotides. Sequences must be treated to reduce bias in downstream analysis. Trimming can help to increase the number of reads the aligner or assembler are able to succesfully use, reducing the number of reads that are unmapped or unassembled. In general, quality treatments include:

Trimming/cutting/masking sequences:
- from low quality score regions
- beginning/end of sequence
- removing adapters
Filtering of sequences
- with low mean quality score
- too short
- with too many ambiguous (N) bases

fastp is a command line tool that performs many of these steps. Let’s run it with our oral2 sample, and direct output of the new, filtered, data to a new file with -o opoption:

BASH

fastp -i /shared/fastqe/female_oral2.fastq.gz -o female_oral_filtered.fastq.gz

Challenge 6: What is filtered?

How many reads are in this FASTQ file before and after filtering?

After filtering, what is the % of reads with a Q score of 20 or higher?

Show me the solution

429 reads are filtered, and 93.7328% of the remaining have a Q score of at leats 20.

OUTPUT


Read1 after filtering:
total reads: 383
total bases: 113368
Q20 bases: 106263(93.7328%)
Q30 bases: 98853(87.1966%)

Filtering result:
reads passed filter: 383
reads failed due to low quality: 428
reads failed due to too many N: 1
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

In addition to command line information, fastp produces an interactive HTML report. We can use our browser platform to view it:

BASH

open fastp.html

Those reports help, but what about our old friend FASTQE? Recall the output on the original file:

OUTPUT

Filename        Statistic       Quality
/shared/fastqe/female_oral2.fastq.gz    mean    😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁
😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁
😁 😁 😁 😁 😁 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😁 😁
😁 😁 😉 😜 😛 😜 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😉 😁 😁 😉 😁 😁 😉 😉 😉 😉 😉
😉 😉 😉 😁 😉 😉 😉 😉 😛 😄 😘 😃 😃 😃 😚 😚 😚 😚 😚 😚 😗 😗 😚 😗 😗 😗 😗 😗
😙 😗 😗 😗 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙
😙 😙 😙 😙 😙 😙 😊 😙 😙 😙 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊
😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😏 😊 😊 😊 😏 😊 😏 😏 😏 😏 😊 😊 😊 😊 😊 😏 😏
😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😅 😅 😅 😅 😅 😅 😅
😅 😅 😅 😅 😅 😅 😀 😀 😀 😅 😀 😀 🚨 😀 😀 😀 😀 😀 😀 🚨 🚨 💩 🚨 🚨 😀 🚨 💩 💩
💩 🚨 🚨 🚨 🚨 💩 🚨 🚨 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩
💩 😡 💩 😾

Lets now run FASTQE on our filtered data from the oral smaple:

BASH

$ fastqe female_oral_filtered.fastq.gz

We can see easily the effect of the filtering and trimming done by fastp:

BASH

Filename        Statistic       Quality
female_oral_filtered.fastq.gz   mean    😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😁 😁 😁 😁
😁 😁 😉 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😁 😉
😉 😉 😉 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😉 😉 😁 😁 😉 😉 😉 😉 😁 😉 😁 😁
😁 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉
😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😜 😜 😜 😜 😉 😉 😜 😜 😜
😜 😜 😉 😜 😜 😉 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜
😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😛 😜 😜 😜 😜 😜 😜 😜 😜 😛 😛 😜 😜
😛 😛 😛 😛 😛 😛 😛 😛 😛 😛 😝 😛 😛 😛 😛 😛 😛 😝 😝 😝 😝 😝 ☺️ 😝 😝 😝 😝 😝
😝 ☺️ ☺️ 😋 😋 ☺️ ☺️ ☺️ 😋 😄 😋 😋 😋 😋 😄 😋 😄 😄 😘 😄 😄 😋 😋 😘 😘 😆 😄 😆 😆 
😆 😆 😄 😆 😘 😘 😆 😘 😘 😘 😘 😆 😘 😘 😘 😘 😘 😘 😘 😘 😘 😃 😘 😘 😃 😃 😃 😡

Things are looking better!

Conclusion

To sum up, you just looked at Illumina FASTQ data quality using Emoji output. You then filtered the low quality sequences in your FASTQ file and output before and after QC plots. Following filtering, you created a new Emoji output of your filter FASTQ file.

You did all of that on the command line, congrats!

Key Points

fastp can be used to filter low quality reads from sequencing data
Filtering can improve the quality of data and downstream analysis
Emoji can be useful!

Overview

Questions

Objectives

Introduction

Challenge 1: Is FASTQE installed? What version?

BASH

Output

OUTPUT

Challenge 2: Can you do the same for fastp? Hint: try -v as well as --version

BASH

Callout

Show me the solution

OUTPUT

Command line structure

Challenge 3: Run this command:

BASH

Output

OUTPUT

Case Study

Challenge 4: What are the files available from this experiment?

Show me the solution

BASH

OUTPUT

Key Points

Introduction

Challenge 1: Can you run fastqe?

BASH

Output

OUTPUT

The FASTQ format

PHRED scores

Case Study

BASH

OUTPUT

Challenge 5: What about the other sample?

Show me the solution

BASH

OUTPUT

Key Points

Filtering

BASH

Challenge 6: What is filtered?

Show me the solution

OUTPUT

BASH

OUTPUT

BASH

BASH

Conclusion

Key Points

Challenge 2: Can you do the same for `fastp`? Hint: try `-v` as well as `--version`

Challenge 1: Can you run `fastqe`?