Quality control

Command-line or GUI Java app (cross platform compatibility, just needs a JRE installed)
Because it is considering short reads from across a whole genome, fastqc considers random data to be good. If the data is not expected to be completely random, then the QC scores may be lower than expected, so it is important to know what sort of data you have and are expecting in order to properly interpret the QC.

  • generates HTML page of QC report, but also a raw data file
    • HTML needs to be copied locally if analysis run on IRIDIS (or possibly logon via the GUI system? Not tried that yet…)#
    • HTML page has clear plots of various quality metrics. Usually averaged for all reads?
  • report is broken down into sections, each of which may pass, fail, or pass with a warning.
    • Basic Statistics
      • number of reads
      • average length
      • fastq encoding type
      • number of reads flagged as poor quality - flagged by fastqc, presumably?
      • GC%
    • Per base sequence quality
      • Average quality scores across length of read
      • usually drops a bit at the ends
      • shows interquartile range, top and bottom 10%, and median for each block of positions (each position if short, may combine if read is longer)
      • blue line is mean quality
      • Background colour coded - green is good, red is bad, orange is ok but lower quality.
      • overall low quality warning if median quality for a position is <25, fails if <20.
    • Per tile sequence quality
      • shows locations of reads on the original Illumina flowcell.
      • Not really sure exactly how to interpret this, but if it's all blue it's doing well.
    • Per sequence quality scores
      • Histogram of quality scores - number of reads that have each score
      • shows the distribution of scores - some lower scores are expected, but ideally most will be in a single peak at the higher end of the range.
    • Per base sequence content
      • The average propertions of bases at each position
      • would expect roughly equal, but not necessarily exactly 25% each
      • May vary if the sequence has skewed GC, but user should be aware if this is likely to be the case.
      • possibly more reads (e.g. exome, genome, scale) may normalise out the levels.
      • warns if >10% difference between bases.
    • Per sequence GC content
      • Creates a histogram of GC% of reads, and compares to a theoretical distribution
      • expects a normal distribution around the genome mean GC% (or an estimate of that)
    • Per base N content
      • Number of unrecognised (N) bases at each position
      • Should be 0 throughout
      • May be a little higher at the ends of the reads
    • Sequence length distribution
      • Histogram of read lengths
      • for Illumina, should be more or less the same
      • some techniques (Nanopore, PacBio, etc.) are more variable
      • warns if reads aren't all the same length
      • fails if any reads are length 0 (how do they still get counted as reads?)
    • Sequence duplication levels
      • measures the amount of sequence duplication
      • histogram of the number of sequences that are duplicated and to what level
      • 0 = not duplicated (unique), 1 = duplicated once (2 copies), etc…
      • Only uses a sample of reads, otherwise it would take too long to run
      • Hopefully, the sample taken will be properly representative of the whole run.
      • Ideally, don't want a large amount of duplication
      • Fails if <50% of sequences are unique
      • Some sample types may have more duplication than others? User should be aware already.
    • Overrepresented sequences
      • Lists any sequences that make up >0.1% of the entire samples
      • probably indicates contamination, or other error (e.g. untrimmed adapters)
    • Adapter content
      • Looks specifically for adapter sequences that have not been removed.
      • May check the whole sample? Unlike overrepresented/kmer checks that only use a sample.
    • Kmer content
      • checks a smaple of reads for short, overrepresented sequences (kmers) or various lengths (up to 7bp?)
      • Unlike overrepresented sequences, this shows the position of kmers within the read
      • so kmers at ends may be related - perhaps fragments of adapters, or ID barcodes.


  • Checks for contamination - how?
  • looks for matches to known genotypes - presumably must be enough matches to trigger?
  • Also looks at allele frequencies - in general population or just a set of given samples?
  • can also detect sample mixup if the other sample genotype is known.
  • Uses a sample BAM file, and a reference VCF or muilti-sample VCF for contaminant genotype information
  • Does it do it's own variant calling? Pretty sure the VCF is only supposed to be reference samples? That could slow things down a bit… Guess you want to look at low-level variants only, so maybe it just looks at every variable site and lets them filter out later?
  • Does not work with indels in the reference VCF, only single-nucleotide variants.
  • Creates a lot of output files! selfRG, selfSM, bestRG, bestSM, depthRG, depthSM
    • selfSM contains the most useful summary - estimate of contamination, as well as some other stats about matching the VCF.
  • *.depthRG
    • read depth separated by read group (in most cases will be one RG per sample)
    • %CUMUL column shows the percentage of SNPs called with at least that level of depth
    • Not quite sure how the 0 row works though. Shows some SNPs called!
  • *.depthSM
    • Same stats as .depthRG, but for all samples
    • presumably verifybamid can be run on a folder containing multiple samples?
  • *.selfRG/selfSM
    • FREEMIX column is the 0-1 estimate of contamination levels.
    • most of the rest are non-mandatory columns for unused options. These will probably all be NA

verifybamid doesn't seem to match up with the other measures… See sample PR0098 - verifybamid says 50% contamination! QC.sample says both contamination checks ok. Also depth of coverage doesn't match up - verifybamid said avg ~20, QC.coverage says 64! coverage.stats matches this. Number of reads in .selfRG is also very low - 2426225 vs 38802034 mapped to target in coverage.stats. ? where does verifybamid get it's information from? VCF and BAM I think, but how does it use it???

Southampton QC files (generated during pipeline)

  • pipeline QC output
    • mapping.stats
      • Output from Novoalign
      • Includes the run settings
      • gives the total numbers of reads, reads aligned, and unique alignments.
      • Histogram of fragment lengths (i.e. distance between pairs), will vary but should be around expected values (normal distribution?)
    • coverage.hist
      • histogram of depth of coverage (e.g. number of bases covered at each depth)
    • coverage.stats
    • _variant.stats
    • metrics.out
  • coverage_calcs
    • runs in exome pipeline - uses the same variables file as the rest of it
    • calculates coverage 1st
      • bamToBed takes a bam file as input and creates a bed file of the regions that are covered (does do some filtering by quality with samtools view)
        • bamToBed has to be set explicitly to look to stdin for input, so it can be piped directly from samtools.
        • outputs in BED6 format: CHR | START | END | NAME | SCORE | STRAND - score is MAPQ score by default THIS IS NOT DEPTH, it's a measure of uniqueness.
      • compares this bed file to the target coverage from the capture kit specific (although not from the variables file, which it probably should be) bed file.
        • uses coverageBed from bedtools
        • calculates breadth of coverage - the amount of each region defined in the capture-kit BED file that is covered in the sample BED file created above. coverageBed also works directly from BAM files, so I don't know why it was converted in the first place!
        • runs coverageBed 3 times:
          • -hist - outputs a histogram of coverage levels (check for most being at a reasonable level)
          • -a just specifies an input file which should be compared against a reference - e.g. the output of coverageBed, to be comapred against another bed file of expected coverage (note: coverage will probably not be 100%, allows a bit of failure - techniques aren't perfect yet),
          • -b - the reference file for comparison.
          • 3 runs are - once with -hist, and two non-histogram runs - one covering exact bed file, one with +-150bp. Reference files are for Agilent sureselect kit S04380110
      • calculates the total number of reads and the percentage of those reads that mapped to within the target area
        • must be one read per line, as it uses NR in awk to count the total numbers - check when the run is done
      • histogram file used to get the percentage of coverage depth
        • -hist option outputs a histogram of read depths. in this case, the 'feature' used to calculate this is the entire BED region merged, so it makes no distinction between capture regions.
        • The output columns are 1) the feature ("all") 2) depth of reads 3) Number of bases at that depth 4) size of feature (~50Mb, for the whole capture kit) 5) percent of bases are the depth.
  • get_metrics_new
    • gives the same information as coverage_calcs, plus some more from the looks of things.
    • doesn't get info from variables from the looks of things - comment at the top about changing file for multiple lanes
    • uses mapping.stats and metrics.out files, but I don't know exactly which scripts makes those!
    • does some find -exec commands to get the files that it wants to work on. Presumably these have all been created with the sampleID from variables, so if this script used that file perhaps it would be slightly more efficient?
    • Looks like it runs on the VAR file?
    • Creates the following files:
      • IDs
        • A list of all the samples processed by the script?
      • QC.coverage
        • coverage and mapping stats
        • number of reads
        • mapped to targets and +- 150 from targets
        • Overall average coverage, and percentages covered above 1x, 5x, 10x, and 20x coverage
      • QC.metrics
        • Some more mapping stats
        • unpaired reads, duplicates reads (PCR and optical), etc.
      • QC.sample
        • Summarises variant stats for each sample.
        • includes contamination checks
          • heterozygotes < 63%
          • alternate reads not > 1% of homozygous reads
          • sex prediction from X heterozygotes percentage

Other QC Metrics

Capture efficiency

The amount of data that maps correctly to the target regions. exomes should generate sequence data for specific regions, but as there is a degree of error and randomness to the capture, there may be some reads that map elsewhere in the genome. These may be unmapped or the may map to non-target regions.

PCR/optical duplicates

These are errors and potential sources of bias (e.g. PCR duplicates containing error called as a variant), and so should be as low as possible. Various tools exist to identify and delete duplicates, but a large number may be indicative of a technical issue.

This can be calculated from the number of correctly mapped reads divided by the total number of reads. It is unlikely that the efficiency will be very high (e.g. >80%) so this may indicate an error just as well as a low efficiency. Guo et al. suggest a typical exome capture efficiency should be between 40 and 70%

Paired mapping

With paired-end sequencing method, the read pairs should map with particular characteristics relative to each other. Orientation and spacing are considered by aligners, and inconsistent pairs are either discarded, or mapped separately as independent single reads. The higher the proportion of pairs mapping correctly, the better. Some pairs will not map, because of technical problems or real variants. An indel-specific realignment step may help to map reads that are disrupted by a structural variant, as there should be multiple pairs affected similarly. Isolated problems are likely technical errors.



?how to estimate?
Look at unmapped reads - if they don't match the human ref, where are they from?
Should be quite low, since the capture kits are human specific
Unlikely that contaminants will be closely related - unless there's a chimp in the lab! Likely bacteria/fungi so should not map to human reference.
VerifyBamID can (if used right!) detect down to ~!% contamination between human samples.
low contamination probably won't be called in most samples, but may be more significant when looking for somatic mutations (e.g. tumour variants in blood samples)

sample cross-contamination

Much more complex, and more dangerous - could introduce a variant, or mask one that is in the patient.

Ti/Tv ratio

Transtition/transversion ratio.
QC after variant calling.
There are twice as many possible transversion as transitions, so if all variants were due to random chance this ratio would be ~0.5. In empirical data from exome sequencing studies, this ratio is actually much higher - around 2.8-3.0 should be expected for a whole-exome capture.

heterozygote percentage

Seems to be the percentage of all variants that are heterozygous. get_metrics_new script sets a limit of 63%, and above that calls potential contamination. Why? this doesn't seem to be supported in the literature anywhere! Literature sources seem to use het/hom ratio, which would give a larger value than het/total? het/hom for whole genome may be ~2.0, which would be around 0.67 het/total. May be different with exome compared to whole genome variants?

I think this is all variants, not just X - because that would have a ver skewed result in males! X heterozygotes percentage is used to guess at sex - <50% is called male.

novel variants and false positives

Difficult to estimate, as a novel variant may be hard to tell apart from an error, but studies show that ~200 (or lower) variants not found in common databases are likely to be present in any exome. If many more unknown variants are being found, this may suggest that a number of false positives are present, and that the data will be unreliable for identifying the true variants. If databases are being queried online, it is probably worth having a check that they are accessible, as a missing database could lead to many more apparently novel variants being present.

Sex chromosome / autosome ratio

This should correspond with the known sex of the patient, although the ratios are unlikely to be exactly perfect. In females, the depth of coverage in the X and an autosome (in my pipeline I chose 15, as it has approximately the same number of features as X) should be roughly equal (~50 seems reasonable for an exome), while there should be no or very low coverage on the Y (there may be some mis-mapping to the pseudoautosomal region, this may be where most of the exome targets are located, but I'm not sure!). Males should have roughly equal X and Y coverage, which should be around half the depth of the autosomes.

I have added calculations for these depths (but not the actual ratios) to the coverage_calcs script.

X chromosome heterozygotes percentage

This is calculated in the get_metrics_new script, which I don't fully understand. So it's very possible I don't fully understand exactly what this part is trying to do.

Heterozygote ratio

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License