Base Quality Distribution solved by 233

Feb. 21, 2014, 5:37 p.m. by Rosalind Team

Topics: Bioinformatics Tools, NGS

All That is Said Literally

Figure 1. Per base quality scores, good dataset.
Figure 2. Per base quality scores, poor dataset.

Another simple quality metric for the QC is a base call quality. As often quality of reads degrades over the course of a sequence it is common practice to determine the average quality of the first, second, third,...nth base by just averaging over all reads in a file.

In the Figure 1 shown an example of a good quality dataset. Read line here is a median value, yellow box - inter-quartile range (25-75%), whiskers - 10% and 90% points. As you can see the sequence reads are 40 bases long, and median value do not fall below 37.

In the Figure 2 shown a poor dataset, and as you can see quality distribution is much wider, and average sequence quality (depicted by the blue line) is steadily declining.

In some Illumina kits the sequence quality goes up a bit first before it starts to decline, because the machine starts sequencing at lower intensity.


Quality of the bases can vary depends on position in read due to nature of the sequencing procedure. One can check this quality distribution using "Per Base Sequence Quality" module of the FastQC program.

Average accepted quality values is a 10 for the lower quartile and 25 for median. If the values falls below this limit, then the module returns a warning.

Note that for the reads >50bp long FastQC will group the bases. To show data for every base in the read use "--nogroup" option.

Given: FASTQ file, quality threshold $q$

Return: Number of positions where mean base quality falls below given threshold

Sample Dataset


Sample Output


Programming Shortcuts

We can use already familiar .letter_annotations dictionary from “Read Quality Distribution”

Please login to solve this problem.