Southampton Pipeline

Basic exome pipeline

variables

  • file containing run-specific variables
  • allows run to be set up without having to alter script files themselves.
  • variables include run ID, number of samples, location of sample files, etc.
  • specify FastQ format - can be useful if analysing older data?
    • aligner is able to support multiple formats?
    • better than running through FastQ groomer, or just easier?
  • capture kit used
    • different capture methods cover different regions
    • there are specific bed files for each kit, so that only the covered regions are considered
    • does this just speed things up, or does it increase specificity? e.g. doesn't even try to align against other regions

1_alignment_mendelian

  • uses #PBS to request resources:
    • 1 node
    • 16 processor cores
    • 12 hours of time
    • Default RAM = 4000MB?
  • loads modules for alignment (novoalign) and running as MPI job (moich)
  • sets a bunch of stuff I don't understand…
    • hosts file? ibhostfile / $PBS_NODEFILE
    • #nprocs confirms the number of processors as requested previously??
    • #run_exe stores the path to the novoalign executable file
      • not available unless loaded, so presumably module load uses a consistent, predictable location.
  • The file 'variables' contains some settings that are customised for each analysis
    • allows the run to be customised without having to update the actual script files
    • e.g. run ID, fastq file locations.
    • loaded with . variables
      • this extracts all the <key>=<value> pairs from the variables file, and saves them as values accessible to the script by using $<key>
  • Runs Novoalign on sample files for each lane as defined in the variables file
    • what exactly do the #mpiexec lines do? - possibly mpi config settings, but seem like calls for novoalign? No input files, so perhaps the do some kind of setup? But why would this need to be run on a separate line - also is it on a separate line? Looks like \ to run over the line break?? Lots of different options, too…
      • -f ibhostfile possibly a specification file?
      • -machinefile ibhostfile as above?
      • -n number of processors?
      • Some novoalign options specific for HPC uses
        • (-)-mmapoff makes each (node? core?) use its own copy of the genome index.
      • final run options (the one that seems like it'll actually run, without any #mpiexec bit…
        • -c 2 number of threads to use (per core??)
        • -d $indexedRefSequence reference file location as defined in variables
        • -f $lane1_pair1 $lane1_pair2 read pair file locations from variables
        • -F $Qformat fastq format - novoalign supports multiple formats, see docs for details.
        • -i 200 30 expected fragment length and standard deviation - throws out pairs not matching??
        • -o SAM sets sam output
        • $'@RG\tID:'$sampleID'_lane1\tSM:'$sampleID'\tPL:ILLUMINA\tLB:Library' sets read group information, which is required by downstream processes. Sets lane information only - otherwise only variable is the sampleID, which is set from the variables file. Read group information is needed for GATK, even if whole bam is from the same group.
        • -o SoftClip turns on soft clipping of reads to the best scoring local alignment. In addition to the previous -o SAM setting.
        • -k base quality calibration
        • -a remove adapter sequences if found - are adapters removed prior to analysis? Uses the default adapter sequence as nothing is specified.
        • -g 65 gap opening penalty
        • -x 7 gap extension penalty
        • —Q2Off note - is Q2 Off, not Q20 ff. That caused some difficulties Googling it…
          • recalibrates quality scores for Q=2 bases
          • Q=2 bases are set by Illumina as a general low-quality base at the end of a read indicator.
          • setting Q2Off ignores this and treats them as any other standard low-quality bases
  • output checks after running…
    • Looks at the last row of the stats file, and expects to see something starting with "#Done".
    • Novoalign stats file is the output of the stderr stream while novoalign is running.

2_post_alignment_mendelian

  • requests 20GB of RAM (more precisely, requests 20gb - is there a significant difference due to capitalisation? b vs B is important, but what about g vs G - and does this ignore capitalisation altogether?)
  • Loads samtools, picard, and the variables file
  • converts SAM to BAM (no BAM output option in novoalign?)
  • BAM files are sorted
    • using picard SortSam
    • sorted by coordinate (position)
  • if more than one lane is used, the BAM files are merged into a single record.
    • Merged BAM must then be sorted again - why do it this way rather than merging first and then sorting once? e.g. this way may do up to 4 sorts on various files!
  • picard is used to mark duplicate reads
    • reads that are identical and appear to be duplicated during PCR. Removed as they may bias. Not sure how it handles errors in these reads - does it only remove 100% duplicates? Does it take a majority consensus as the true read?
  • BAM file indexed (samtools index), to speed up access.
  • Uses samtools for variant calling
    • pipeline of samtools view, samtools mpileup, and bcftools
    • possibly the recommended samtools pipeline that I never managed to get to work on e coli?
    • samtools view -bq 20 -F 1796 "$sampleID"_novoalign.bam |
      • -b sets ouput format to BAM
      • -q 20 may be an alignment quality threshold? Is not specifically under samtools view in the documentation, so may have a different function?
      • -F 1796 ignores alignments with SAM flags matching the parameter. In this case, 1796 means unmapped, not primary, failed, and PCR duplicate flags. Look online for guide as to how flags work…
    • samtools mpileup -EDSgu -d 2000 -f human_g1k_v37.fasta -L 2000 -F 0.05 - |
      • -E recalculate base alignment quality scores
      • -D output read depth - DEPRECATED
      • -S output strand bias score - DEPRECATED
        • Both these should now be under -t DP, SP (possibly, I don't know if that's formatted exactly correctly!)
      • -g output in bcf format (similar to vcf, but compressed - binary vcf)#
      • -u output uncompressed (e.g. still binary bcf, but with no compression applied)
      • -d 2000 maximum depth per position. If more than 2000 reads align, only first 2000 will count (not sure how ranked - by alignment score, or just first n reads in BAM?)
      • -f human…v37.fasta reference file to use - is this copied or otherwise linked to? Because it's not in the analysis folder!
      • -L 2000 don't call indels if average depth is >2000 (average over how long? whole sequence, local region, what?)
      • -F 0.5 minimum fraction that should be gapped reads - does that mean proper paired-end reads with a gap between them?
    • bcftools view -Nvg - > "$sampleID"_raw.vcf
      • -N this option doesn't seem to be in the bcftools documentation…
      • -v -v should have a comma-separated list of variant types to report. But this doesn't. I'm confused…
      • -g should be selecting genotype type, but that doesn't seem to be defined. Maybe this and -v are what the trailing '-' is for?

3_annotate_clinical

  • large script - 656 lines!
  • lots of processing, so will probably only detail the sections in it…
  • lots of filters mostly using awk to annotate the low quality calls in the vcf
  • filtering is performed in ANNOVAR format (so final file can be annotated with ANNOVAR, presumably…)
  • It looks like very filter is run through for every line - what happens if a variant fails on more than one criteria? Is it duplciated in the output? (Maybe there's a dup removal step I haven't got to yet…)
  • 0) vcfutils.pl varFilter -d 4 is used to remove all variants that have fewer than 4 supporting reads. Then header lines are removed (grep -v '#') to convert to annovar format.
  • 1) SNVs
    • note - most filters are awk scripts starting with 'BEGIN {FS=OFS="\t"} FS and OFS are field separators in input and output files respectively. This line sets them to a single tab.
    • This means that each line of the data will be split into separate variables wherever there is a tab. These variables can be accessed using $1, $2, $3, etc. $6 is PHRED score for example.
    • Not the » redirects, so the lines are appended to the annovar file, rather than overwriting and ending up with a file with only the last line!
    • PHRED score cutoff of 20 - why are het and hom variants checked separately?
      • because variants are originally called as 1/1 or 0/1, and downstream it wants to be HOM or HET.
  • 2) Insertions
    • PHRED >= 20
    • filters variants in homopolymer repeats differently - possibly just because of uncertainty over exact location?
    • again, HET and HOM processed separately to go from 1/1, 0/1 to HOM, HET.
  • 3) Deletions
    • deletion filters don't account for homopolymers - less common error, possibly?
  • 4) Annotate
    • ANNOVAR
    • Followed by more awk processing!
    • and some grepping, too!
    • annotated with genes/locations
      • annotate_variation.pl (-)-buildver hg19 (-)-splicing_threshold 10 -geneanno "$sampleID"_1.annovar /scratch/GENE_DATABASES/humandb/HG19_APR13/
      • This reorders some of the fields, so a quick awk script arranges it back how it should be.
      • It then annotates again for some reason?
        • exactly the same parameters - WHY? There must be some reason…
        • Then needs rearranging again…
    • annotate_variation.pl (-)-buildver hg19 (-)-filter (-)-dbtype snp137NonFlagged "$sampleID"_2.annovar /scratch/GENE_DATABASES/humandb/HG19_APR13/
      • buildver sets the genome version to use, but does that need to be set up prior (would assume so…)?
      • —filter tells annovar to annotate against a given database, which is specified with (-)-dbtype (in this case dbSNP)
      • "$sampleID"_2.annovar is the annovar format file for this run.
      • Is the last part the location of the databases? So it will look in this location for the snp137NonFlagged db given earlier?
        • looks like it - folder contains a snp137NonFlagged.txt file that seems to be a tab-delimited list of variants with rs numbers.
    • Do it again with 1000 genomes database instead of dbSNP
  • 5) Check in-house databases (if specified)
    • If a database was specified in the variables file during run setup, this will load the appropriate databases and check them against the variants
    • Each disease area has 2 databases - a diseaseDB and NonDiseaseDB (yes, they're actually capitalised that way…)
    • The disease / database name should be stored in the $sotonDB variable
      • if there is no database specified, or it is mispelled or not recognised, then no database is loaded.
      • These databases are also text files
      • chromosome, start, end, ref, alt, and an info field
    • Also uses several further databases
      • somatic database in all cases? - looks like it
      • (all filter steps are followed by a re-ordering awk step)
      • EVS database
      • ljb_all (whatever that is…) seems to be an annovar included thing?
      • complete genomics - 46 genomes (all high-quality, good depth, unrelated). Closest thing to a gold standard atm.
  • 6) Further annotation
    • "assess and report novelty"
      • quite a complex looking awk script, but I think that's mainly due to the number of columns and the associate variables that it uses…
      • Identifies and labels a few classes of variant, from fields that have been previously added during the annotation
      • it's probably not important to actually know what these fields are, so honestly I can't be bothered.
    • adds grantham scores for nonsynonymous variants
      • OK, this one is a bit confusing…
      • seems like it has a list of amino acid changes and the corresponding gramtham score, and somehow finds the appropriate one
    • mutability flag??
      • looking at regions that are known to be mutable?
      • has a threshold cutoff score
    • false positives
      • Fuentes et al. 2011 - probably worth reading to figure out just what's going on?
    • strand bias flag
    • depth flag
    • surrounding sequence script - this actually could be very useful for project stuff!
    • Writes a report on total numbers of variants of various classes - check if run ever starts!
      • e.g. in dbSNP database, not in dbSNP, seen in-house, etc.
      • even classifies transitions vs transversions, and works out some percentages too
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License