Alignment Guide

Alignement Guide

What is alignment?

Alignment is the process of matching one or more query sequences to a reference in a way that maximises a chosen scoring system. For example, matching the greatest number of individual nucleotides, or minimising the number of gaps. Alignment can identify variation from the reference; including substitutions (showing as mismatches) and insertion/deletions (creating gaps in the reference and sample respectively). Alignment algorithms generally try to avoid creating multiple gaps, as this may be less biologically likely than a larger single gap. Alignment scoring often includes a one-time gap opening penalty, and a smaller gap extension penalty that increases with the size of gap. The exact scoring used will depend on the sequences being compared - highly similar sequences, like a patient sample against the human genome, will probably be more stringent than sequences from different species where more variation is expected.

Perfect alignment:
ACCGCTCA
||||||||
ACCGCTCA

Aligment with mismatch:
ACCGCGTCA
||||| |||
ACCGCATCA

Alignment with gap:
ACCGCT-A
|||||| |
ACCGCTCA

Multiple alignment involves aligning more than one sequence, and is a complex procedure that may not have a single optimal solution.

Local vs. Global alignment

http://www.cs.umd.edu/class/fall2011/cmsc858s/Alignment.pdf

Global alignment

Useful when the entire sequence is expected to be similar
- e.g. closely related homologous genes
Aligns the full length of both sequences

e.g. Needleman-Wunsch alignment

Examples stolen from Stack Exchange.

5' ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA 3'
   |||||||||||    |||||||  |||||||||||||| |||||||
5' ACTACTAGATT----ACGGATC--GTACTTTAGAGGCTAGCAACCA 3'

Local alignment

Used when the optimal alignment is likely to be a substring of a larger sequence
- e.g. aligning a primer sequence against a genomic reference sequence.
Optimal alignment may be shorter than full length of sequences.

e.g. Smith-Waterman alignment

5' ACTACTAGATTACTTACGGATCAGGTACTTTAGAGGCTTGCAACCA 3' 
             |||| |||||| |||||||||||||||
          5' TACTCACGGATGAGGTACTTTAGAGGC 3'

Single or Paired end reads

Paired-end reads are generated by sequencing from both ends of a fragment, producing two reverse reads that are an approximately known distance apart. Paired end reads can be aligned more reliably than single reads, because this known spacing helps to cover repetitive regions.

Representing alignments

Small alignments can be represented in the format shown above, but longer alignments need a computationally useful representation, along with a simple human readable summary. One such summary is the CIGAR string.

CIGAR string

CIGAR stands for Compact Idiosyncratic Gapped Alignment Report.

A cigar string represents an aligned sequence compared to a reference, by using three symbols to describe runs of the sequence that match the quality represented by that symbol. The three symbols are: M - alignment match (may be a mismatch but still aligned), I - insertion, and D - deletion. So the three alignments represented at the top of this page would be: 8M, 8M, and 6M1I1M.

CIGAR strings are often used to represent the alignment of NGS reads against a reference genome. Here they may be useful as a quick indicator of sequencing quality, as there should be relatively few insertions and deletions.

CIGAR strings are always related to the sequence being aligned, not the reference.

Aligners

tools

page revision: 16, last edited: 26 May 2015 15:01

Edit Tags History Files Print Site tools + Options

Bioinformatics Notes

Notes for clinical bioinformatics training

navigation

search

toolbox

pages

watchers