Content from Command line tools for quality control


Last updated on 2024-09-08 | Edit this page

Overview

Questions

  • How to perform quality control of NGS raw data?

Objectives

  • Assess short reads FASTQ quality using FASTQE
  • Compare reads before and after quality filtering with fastp

Introduction


During sequencing, the nucleotide bases in a DNA or RNA sample (library) are determined by the sequencer. For each fragment in the library, a sequence is generated, also called a read, which is simply a succession of nucleotides.

Modern sequencing technologies can generate a massive number of sequence reads in a single experiment. However, no sequencing technology is perfect, and each instrument will generate different types and amount of errors, such as incorrect nucleotides being called. These wrongly called bases are due to the technical limitations of each sequencing platform.

Therefore, it is necessary to understand, identify and exclude error-types that may impact the interpretation of downstream analysis. Sequence quality control is therefore an essential first step in your analysis. Catching errors early saves time later on.

To take a look at sequence quality along all sequences, we can use FASTQE. It is an open-source tool that provides a simple and fun way to quality control raw sequence data and print them as emoji. You can use it to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

We will also use the tool fastp which can be used to filter sequence data to remove low quality reads.

Challenge 1: Is FASTQE installed? What version?

What is the output of this command?

BASH

$ fastqe --version

OUTPUT

fastqe 1.0

Challenge 2: Can you do the same for fastp? Hint: try -v as well as --version

BASH

$ fastp --version # or use -v

Callout

When using the command line code, lines that start with the # sign are typically text descriptors or instructions, not lines of code.

The # character is used to indicate comments. A line starting with # line doesn’t do anything, in this case it’s just telling you what the next line of code does which is navigate to your computer’s desktop - We will stick with this convention throughout the activity - Comments can occur anywhere in a line, anything after the # will be ignored

OUTPUT

$ fastp --version
fastp 0.20.1

You should get the same result from using -v and --version which are known as short and long options respectivlely. Long options will typically (but not always) have a corresonidng short option.

Command line structure


We have mentioned commands and options so far. The syntax is typically as follows:

Structure of a shell command.
The structure of a typical command line entry.

Challenge 3: Run this command:

BASH

$ ls -F /

OUTPUT

dev/  home/  proc/  shared/  tmp/

This is the filesystem of our browser-based portal.

Case Study


Here’s brief background on the in class metagenomics project that we’re collecting sequencing data for. Garter snakes excrete sexually dimorphic pheromones to attract a mate. The hypothesis of our experiment is that male and female garter snakes host unique microbial communities in their mouths, cloacae and musk glands that contribute to sexually dimorphic bioengineering of these pheromone molecules. Figure 1 provides an overview of our 16S metagenomics analysis pipeline. For this lesson though, all you need are the FASTQ files.

Images of experiment and data workflow
Figure 1. Overview of the in class metagenomics project. Using a saline swabbing technique, microbial samples were collected from garter snake tissues in class (A). Swabs were placed in sterile tubes to release collected microbes and DNA was extracted for downstream analysis (B). Barcoded primers were used to PCR amplify the microbial 16S ribosomal DNA repeat genes for each sample followed by Illumina sequencing of PCR amplicons (C-D). The DNA Subway Purple Line web-based software can be used to analyze FASTQ data files generated from Illumina sequencing to reveal the microbial population of our swabs (E). Garter snakes were provided by Dr. Rocky Parker in the JMU Department of Biology (A; yellow shirt).

The data for this experiment is located in the /shared/fastqe/ folder.

Challenge 4: What are the files available from this experiment?

Hint: Use the ls command we have already seen

BASH

ls -F /shared/fastqe/

OUTPUT

female_musk2.fastq.gz  female_oral2.fastq.gz

We have two files, one for each of the snake tissues sampled.

Key Points

  • Options can be long and short
  • Structure of a shell command
  • We have confirmed fastp and fastqe are installed
  • The case study and location of the data

Content from The FASTQ format and FASTQE


Last updated on 2024-09-08 | Edit this page

Introduction


The first step of any Next Generation Sequencing (NGS) analysis pipeline is checking the quality of the raw sequencing reads in each FASTQ formatted file. If the sequence quality is poor, then your resulting downstream analysis will be inaccurate and misleading. FastQC is a popular software used to provide an overview of basic quality metrics for NGS data. In this lesson, you will use an even more universal form of communication to analyze FASTQ files, THE EMOJI.

The tool we will use for this visualisation is fastqe

Challenge 1: Can you run fastqe?

What is the output of this command from your terminal?

BASH

fastqe --help

If you have fastqe installed correctly, you should get usage information:

OUTPUT

$ fastqe --help
Read one or more FASTQ files, compute quality stats for each file, print as emoji... for some reason.😄

🚨 Rust and WebAssembly beta: only command line options with a  ✅  are functional
 rustc 1.75.0-beta.7 emcc 3.1.50

Usage: fastqe [OPTIONS] [FASTQ_FILE]...

Arguments:
  [FASTQ_FILE]...  Input FASTQ files

Options:
      --noheader              Hide the header before sample output ❌
      --output <OUTPUT_FILE>  Write output to OUTPUT_FILE instead of stdout
      --long <READ_BUFFER>    Buffer memory for long reads up to READ_BUFFER bp long ❌ [default: 500]
      --log <LOG_FILE>        Record program progress in LOG_FILE ✅
  -h, --help                  Print help
  -V, --version               Print version

Emoji options:
      --bin                   Use binned scores (🚫 💀 💩 🚨 😄 😆 😎 😍 ) ❌
      --custom <CUSTOM_DICT>  Use a mapping of custom emoji to quality in CUSTOM_DICT (🐍🌴) ❌
      --noemoji               Use mapping without emoji (▁▂▃▄▅▆▇█) ❌

Statistics options:
      --minlen <N>  Minimum length sequence to include in stats ✅ [default: 0]
      --scale       Show relevant scale in output ❌
      --nomean      Hide mean quality per position ❌
      --min         Show minimum quality per position  ❌
      --max         Show maximum quality per position ❌

HTML report options:
      --html         Output all data as html ❌
      --window <W>   Window length to summarise reads in HTML report ❌ [default: 1]
      --html_escape  ❌ Escape html within output, e.g. for Galaxy parsing

For more information, vist https://fastqe.com

FASTQE has a lot of options! Note that this version does not implement all of them, however we will be using it in the default mode.

The FASTQ format


Although it looks complicated (and maybe it is), the FASTQ format is easy to understand with a little decoding.

Each read, representing a fragment of the library, is encoded by 4 lines:

  1. Always begins with @ followed by the information about the read
  2. The actual nucleic sequence
  3. Always begins with a + and contains sometimes the same info in line 1
  4. Has a string of characters which represent the quality scores associated with each base of the nucleic sequence; must have the same number of characters as line 2

So for example, the first sequence in our file is:

@M00970:337:000000000-BR5KF:1:1102:17745:1557 1:N:0:CGCAGAAC+ACAGAGTT
GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(

It means that the fragment named @M00970 corresponds to the DNA sequence GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA and this sequence has been sequenced with a quality GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(.

PHRED scores


THE FASTQ format encodes quality as ASCII values, which are mapped to quality scores in the PHRED format. The PHRED score is calculated from P, the probability of a base-calling error:

\(Q = -10\log{P}\)

These values are linked by the Phred Quality Score, Probability of incorrect base call, and Base call accuracy. We can summarise these:

PHRED	Probability	Accuracy
10	1 in 10		90%
20	1 in 100	99%
30	1 in 1000	99.9%
40	1 in 10,000	99.99%
50	1 in 100,000	99.999%
60	1 in 1,000,000	99.9999%

FASTQE aids in the interpretation of these by mapping the Phred Quality Score and ASCII code to an Emoji:

PHRED	ASCII	Emoji
0	!	🚫
1	”	❌
2	#	👺
3	$	💔
4	%	🙅
5	&	👾
6	’	👿
7	(	💀
8	)	👻
9	*	🙈
10	+	🙉
11	,	🙊
12	-	🐵
13	.	😿
14	/	😾
15	0	🙀
16	1	💣
17	2	🔥
18	3	😡
19	4	💩

Much easier to understand! Here we show the emoji for Q scores below 20, which represents a 1 in 100 probability of a base error. Given there can be many thousands of reads and bases in even a small sequeuncing sample, the impact of these errors on downstream analysis and conclusions is significant. Steps we will see later in the lesson can be used to remove low quality reads.

Case Study


Let’s look at the output of FASTQE on our case study data:

BASH

$ fastqe /shared/fastqe/female_musk2.fastq.gz

which should give the following:

OUTPUT

$ fastqe /shared/fastqe/female_musk2.fastq.gz
Filename        Statistic       Quality
/shared/fastqe/female_musk2.fastq.gz    mean    😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁
😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁
😁 😁 😁 😁 😁 😉 😉 😁 😁 😁 😁 😁 😉 😁 😁 😁 😁 😁 😁 😉 😁 😉 😁 😁 😉 😁 😁 😁
😁 😁 😛 😜 😉 😁 😉 😉 😁 😁 😁 😁 😁 😉 😁 😁 😁 😁 😁 😁 😁 😁 😉 😉 😉 😉 😉 😁
😁 😉 😁 😁 😉 😉 😉 😛 😋 😄 😄 😆 😄 😆 😆 😆 😆 😆 😘 😆 😆 😆 😆 😘 😘 😘 😘 😘
😘 😘 😘 😘 😘 😘 😘 😘 😘 😘 😘 😘 😃 😃 😘 😘 😘 😘 😘 😃 😃 😃 😃 😃 😃 😃 😘 😘
😃 😘 😃 😃 😘 😃 😃 😃 😃 😃 😃 😃 😃 😃 😃 😚 😃 😃 😃 😃 😃 😃 😃 😃 😃 😚 😃 😃
😃 😃 😃 😃 😃 😚 😚 😚 😚 😚 😃 😚 😚 😚 😚 😃 😚 😚 😃 😃 😚 😚 😚 😃 😚 😚 😚 😚
😚 😚 😚 😚 😚 😚 😚 😚 😚 😗 😚 😚 😚 😗 😗 😗 😗 😗 😚 😚 😗 😗 😗 😗 😗 😗 😙 😗
😙 😙 😙 😙 😗 😙 😏 😊 😙 😙 😙 😊 😊 😏 😊 😊 😊 😏 😏 😅 😏 😅 😏 😊 😊 😊 😅 😅
😅 😅 😏 😅 😏 😅 😊 😏 😅 😏 😏 😅 😅 😅 😅 😏 😅 😅 😏 😅 😅 😀 😀 😀 😅 😀 😅 😀
😅 😀 🚨 😡

Challenge 5: What about the other sample?

Can you use fastqe on the other tissue sample? Which sample has better overall quality? Why?

You should run fastqe with a different filename:

BASH

$ fastqe /shared/fastqe/female_oral2.fastq.gz 

and the output will be:

OUTPUT

Filename        Statistic       Quality
/shared/fastqe/female_oral2.fastq.gz    mean    😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁
😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁
😁 😁 😁 😁 😁 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😁 😁
😁 😁 😉 😜 😛 😜 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😉 😁 😁 😉 😁 😁 😉 😉 😉 😉 😉
😉 😉 😉 😁 😉 😉 😉 😉 😛 😄 😘 😃 😃 😃 😚 😚 😚 😚 😚 😚 😗 😗 😚 😗 😗 😗 😗 😗
😙 😗 😗 😗 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙
😙 😙 😙 😙 😙 😙 😊 😙 😙 😙 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊
😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😏 😊 😊 😊 😏 😊 😏 😏 😏 😏 😊 😊 😊 😊 😊 😏 😏
😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😅 😅 😅 😅 😅 😅 😅
😅 😅 😅 😅 😅 😅 😀 😀 😀 😅 😀 😀 🚨 😀 😀 😀 😀 😀 😀 🚨 🚨 💩 🚨 🚨 😀 🚨 💩 💩
💩 🚨 🚨 🚨 🚨 💩 🚨 🚨 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩
💩 😡 💩 😾

The mouth (oral) sample has inferirior quality, on average, towards the end of the reads.

To look at the quality of these files in more detail, we will next use the emoji-less command line tool fastp.

Key Points

  • The options available for fastqe
  • The formula for Q scores
  • FASTQE maps scores to emoji
  • FASTQE can be used to quickly asses the quality of sequencing data

Content from Filtering with fastp


Last updated on 2024-09-08 | Edit this page

Filtering


The quality drop off observed could cause bias in downstream analyses with these potentially incorrectly called nucleotides. Sequences must be treated to reduce bias in downstream analysis. Trimming can help to increase the number of reads the aligner or assembler are able to succesfully use, reducing the number of reads that are unmapped or unassembled. In general, quality treatments include:

  • Trimming/cutting/masking sequences:
    • from low quality score regions
    • beginning/end of sequence
    • removing adapters
  • Filtering of sequences
    • with low mean quality score
    • too short
    • with too many ambiguous (N) bases

fastp is a command line tool that performs many of these steps. Let’s run it with our oral2 sample, and direct output of the new, filtered, data to a new file with -o opoption:

BASH

fastp -i /shared/fastqe/female_oral2.fastq.gz -o female_oral_filtered.fastq.gz

Challenge 6: What is filtered?

How many reads are in this FASTQ file before and after filtering?

After filtering, what is the % of reads with a Q score of 20 or higher?

429 reads are filtered, and 93.7328% of the remaining have a Q score of at leats 20.

OUTPUT


Read1 after filtering:
total reads: 383
total bases: 113368
Q20 bases: 106263(93.7328%)
Q30 bases: 98853(87.1966%)

Filtering result:
reads passed filter: 383
reads failed due to low quality: 428
reads failed due to too many N: 1
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

In addition to command line information, fastp produces an interactive HTML report. We can use our browser platform to view it:

BASH

open fastp.html

Those reports help, but what about our old friend FASTQE? Recall the output on the original file:

OUTPUT

Filename        Statistic       Quality
/shared/fastqe/female_oral2.fastq.gz    mean    😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁
😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁
😁 😁 😁 😁 😁 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😁 😁
😁 😁 😉 😜 😛 😜 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😉 😁 😁 😉 😁 😁 😉 😉 😉 😉 😉
😉 😉 😉 😁 😉 😉 😉 😉 😛 😄 😘 😃 😃 😃 😚 😚 😚 😚 😚 😚 😗 😗 😚 😗 😗 😗 😗 😗
😙 😗 😗 😗 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙 😙
😙 😙 😙 😙 😙 😙 😊 😙 😙 😙 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊
😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😊 😏 😊 😊 😊 😏 😊 😏 😏 😏 😏 😊 😊 😊 😊 😊 😏 😏
😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😏 😅 😅 😅 😅 😅 😅 😅
😅 😅 😅 😅 😅 😅 😀 😀 😀 😅 😀 😀 🚨 😀 😀 😀 😀 😀 😀 🚨 🚨 💩 🚨 🚨 😀 🚨 💩 💩
💩 🚨 🚨 🚨 🚨 💩 🚨 🚨 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩 💩
💩 😡 💩 😾

Lets now run FASTQE on our filtered data from the oral smaple:

BASH

$ fastqe female_oral_filtered.fastq.gz

We can see easily the effect of the filtering and trimming done by fastp:

BASH

Filename        Statistic       Quality
female_oral_filtered.fastq.gz   mean    😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😁 😁 😁 😁
😁 😁 😉 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😁 😉
😉 😉 😉 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😉 😉 😁 😁 😉 😉 😉 😉 😁 😉 😁 😁
😁 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉
😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😜 😜 😜 😜 😉 😉 😜 😜 😜
😜 😜 😉 😜 😜 😉 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜
😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😛 😜 😜 😜 😜 😜 😜 😜 😜 😛 😛 😜 😜
😛 😛 😛 😛 😛 😛 😛 😛 😛 😛 😝 😛 😛 😛 😛 😛 😛 😝 😝 😝 😝 😝 ☺️ 😝 😝 😝 😝 😝
😝 ☺️ ☺️ 😋 😋 ☺️ ☺️ ☺️ 😋 😄 😋 😋 😋 😋 😄 😋 😄 😄 😘 😄 😄 😋 😋 😘 😘 😆 😄 😆 😆 
😆 😆 😄 😆 😘 😘 😆 😘 😘 😘 😘 😆 😘 😘 😘 😘 😘 😘 😘 😘 😘 😃 😘 😘 😃 😃 😃 😡

Things are looking better!

Conclusion


To sum up, you just looked at Illumina FASTQ data quality using Emoji output. You then filtered the low quality sequences in your FASTQ file and output before and after QC plots. Following filtering, you created a new Emoji output of your filter FASTQ file.

You did all of that on the command line, congrats!

Key Points

  • fastp can be used to filter low quality reads from sequencing data
  • Filtering can improve the quality of data and downstream analysis
  • Emoji can be useful!