The FASTQ format and FASTQE
Last updated on 2024-09-08 | Edit this page
Introduction
The first step of any Next Generation Sequencing (NGS) analysis pipeline is checking the quality of the raw sequencing reads in each FASTQ formatted file. If the sequence quality is poor, then your resulting downstream analysis will be inaccurate and misleading. FastQC is a popular software used to provide an overview of basic quality metrics for NGS data. In this lesson, you will use an even more universal form of communication to analyze FASTQ files, THE EMOJI.
The tool we will use for this visualisation is
fastqe
If you have fastqe installed correctly, you should get
usage information:
OUTPUT
$ fastqe --help
Read one or more FASTQ files, compute quality stats for each file, print as emoji... for some reason.π
π¨ Rust and WebAssembly beta: only command line options with a β
are functional
rustc 1.75.0-beta.7 emcc 3.1.50
Usage: fastqe [OPTIONS] [FASTQ_FILE]...
Arguments:
[FASTQ_FILE]... Input FASTQ files
Options:
--noheader Hide the header before sample output β
--output <OUTPUT_FILE> Write output to OUTPUT_FILE instead of stdout
--long <READ_BUFFER> Buffer memory for long reads up to READ_BUFFER bp long β [default: 500]
--log <LOG_FILE> Record program progress in LOG_FILE β
-h, --help Print help
-V, --version Print version
Emoji options:
--bin Use binned scores (π« π π© π¨ π π π π ) β
--custom <CUSTOM_DICT> Use a mapping of custom emoji to quality in CUSTOM_DICT (ππ΄) β
--noemoji Use mapping without emoji (βββββ
βββ) β
Statistics options:
--minlen <N> Minimum length sequence to include in stats β
[default: 0]
--scale Show relevant scale in output β
--nomean Hide mean quality per position β
--min Show minimum quality per position β
--max Show maximum quality per position β
HTML report options:
--html Output all data as html β
--window <W> Window length to summarise reads in HTML report β [default: 1]
--html_escape β Escape html within output, e.g. for Galaxy parsing
For more information, vist https://fastqe.com
FASTQE has a lot of options! Note that this version does not implement all of them, however we will be using it in the default mode.
The FASTQ format
Although it looks complicated (and maybe it is), the FASTQ format is easy to understand with a little decoding.
Each read, representing a fragment of the library, is encoded by 4 lines:
- Always begins with @ followed by the information about the read
- The actual nucleic sequence
- Always begins with a + and contains sometimes the same info in line 1
- Has a string of characters which represent the quality scores associated with each base of the nucleic sequence; must have the same number of characters as line 2
So for example, the first sequence in our file is:
@M00970:337:000000000-BR5KF:1:1102:17745:1557 1:N:0:CGCAGAAC+ACAGAGTT
GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA
+
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(
It means that the fragment named @M00970 corresponds to
the DNA sequence
GTGCCAGCCGCCGCGGTAGTCCGACGTGGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGAAGAACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAAAGAAGCAAATGACGATTCAAGAAAGAAAAAAACACAGAATACTAACAATAAGTCATAAACATCATCAACATAAAAAAGGAAATACACTTACAACACATATCAATATCTAAAATAAATGATCAGCACACAACATGACGATTACCACACATGTGTACTACAAGTCAACTA
and this sequence has been sequenced with a quality
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGGAFFGGFGGGGGGGGFGGGGGGGGGGGGGGFGGG+38+35*311*6,,31=******441+++0+0++0+*1*2++2++0*+*2*02*/***1*+++0+0++38++00++++++++++0+0+2++*+*+*+*+*****+0**+0**+***+)*.***1**//*)***)/)*)))*)))*),)0(((-((((-.(4(,,))).,(())))))).)))))))-))-(.
PHRED scores
THE FASTQ format encodes quality as ASCII values, which are mapped to quality scores in the PHRED format. The PHRED score is calculated from P, the probability of a base-calling error:
\(Q = -10\log{P}\)
These values are linked by the Phred Quality Score, Probability of incorrect base call, and Base call accuracy. We can summarise these:
PHRED Probability Accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%
FASTQE aids in the interpretation of these by mapping the Phred Quality Score and ASCII code to an Emoji:
PHRED ASCII Emoji
0 ! π«
1 β β
2 # πΊ
3 $ π
4 % π
5 & πΎ
6 β πΏ
7 ( π
8 ) π»
9 * π
10 + π
11 , π
12 - π΅
13 . πΏ
14 / πΎ
15 0 π
16 1 π£
17 2 π₯
18 3 π‘
19 4 π©
Much easier to understand! Here we show the emoji for Q scores below 20, which represents a 1 in 100 probability of a base error. Given there can be many thousands of reads and bases in even a small sequeuncing sample, the impact of these errors on downstream analysis and conclusions is significant. Steps we will see later in the lesson can be used to remove low quality reads.
Case Study
Letβs look at the output of FASTQE on our case study data:
which should give the following:
OUTPUT
$ fastqe /shared/fastqe/female_musk2.fastq.gz
Filename Statistic Quality
/shared/fastqe/female_musk2.fastq.gz mean π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π
π π
π π π π π
π
π
π
π π
π π
π π π
π π π
π
π
π
π π
π
π π
π
π π π π
π π
π
π
π π¨ π‘
Challenge 5: What about the other sample?
Can you use fastqe on the other tissue sample? Which
sample has better overall quality? Why?
You should run fastqe with a different filename:
and the output will be:
OUTPUT
Filename Statistic Quality
/shared/fastqe/female_oral2.fastq.gz mean π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π π π π π π π
π π π π π π π π π π π π π π π π π π π π π π
π
π
π
π
π
π
π
π
π
π
π
π
π π π π
π π π¨ π π π π π π π¨ π¨ π© π¨ π¨ π π¨ π© π©
π© π¨ π¨ π¨ π¨ π© π¨ π¨ π© π© π© π© π© π© π© π© π© π© π© π© π© π© π© π© π© π© π© π©
π© π‘ π© πΎ
The mouth (oral) sample has inferirior quality, on average, towards the end of the reads.
To look at the quality of these files in more detail, we will next
use the emoji-less command line tool fastp.
Key Points
- The options available for
fastqe - The formula for Q scores
- FASTQE maps scores to emoji
- FASTQE can be used to quickly asses the quality of sequencing data