WRGL Pipeline

Genotyping

WRGL genotyping pipeline

Panels

WRGL panels pipeline

Programme (main) class

Args: path to MiSeq data folder

  • uses AuxillaryFunctions to get paths to several folders within the MiSeq folder
    • these functions are passed the data folder as an argument, and work out paths relative to this.
  • Creates ProgrammeParameters object
    • class defined in ProgrammeParameters.cs
    • reads from a local .ini config file
  • Creates ParseSampleSheet object
    • uses paths derived above to load and parse run sample sheet
  • Logs these events
    • using AuxillaryFunctions.WriteLog()
  • For each sample listed in the sample sheet (found using the ParseSampleSheet object getAnalyses method), sent to the appropriate pipeline (P or G) based on the parsed sample sheet.

AuxillaryFunctions [spelled as in project] class

  • LookupAmpliconID()
    • Looks through a list of BEDRecord objects to find a match for a given gVariant tuple
    • C# tuples are accessed by <tuple>.Item1, <tuple>.Item2, etc.
    • checks chromosome, gVariant.Item1 first, then if matched checks start coord (gVariant.Item2).
    • returns the name from the BEDRecord object that matches.
  • WriteLog()
    • is passed a message and details, which are then appended to a local log file.
    • one first use in each run, the run ID is written on a separate line in the log file.
  • GetFastqDir()
    • fastq directory is 1 up from supplied directory?
  • GetRunID()
    • is based on length of folders array??
  • GetRootRunDir()
    • 4 up from the supplied directory?
  • GetLocalAnalysisFolderDir()
    • 5 up from the supplied directory?
  • SendRunFailEmail()
    • sends an email to the admin email defined in the parameters file in the event of an error
    • automatically attaches the log to the email
  • SendRunCompletionEmail()
    • sends an email to subscribed users when the run is complete.

BEDRecord/ParseBED class

  • BEDRecord struct:
public string chromosome;
public UInt32 start;
public UInt32 end;
public string name;
  • ParseBED takes a path to a BED file, and creates a list of extracted BEDRecord objects
    • Requires BED4 format - chromosome, start, end, and name.
    • reads file line-by-line
      • skips blank lines
      • checks none of the 4 columns are blank
        • if so, fails and reports a malformed BED file.
    • Exposes a public list, getBEDRecords that return the processed list.

FileManagement class

  • BackupFiles()
    • copies all run files (including fastq) to a backup directory
    • ?/scratch/WRGL/ on Iridis?
    • Should be defined in WRGLPipeline.ini
    • Includes MD5 hash checking for large files

ProgrammeParameters class

  • Opens the WRGLPipeline.ini configuration file, and reads the lines into a dictionary. This is then used to populate a list of variables in the ProgrammeParamerters object, that can be accessed by the programme.
  • .ini file is formatted <key>=<value>, and is split at "=" and added to a dictionary
    • This dictionary is then used to fill a list of config variables,
  • Config options include:
    • BED file
    • Iridis username and password key - includes functions to encrypt/decrypt password as needed.
    • NHSmail account details, for run completion emails.
    • reference files (genome, known indels)
    • paths to tools (java, gatk, samtools, snpeff) ? These are local paths or paths on Iridis?
    • Includes properties ({get; set;}) for accessing variables in object.

ParseSampleSheet class

  • additional data structure (in class file, outside of class itself)
    • SampleRecord - contains sample ID, name, number, and analysis
  • ParseSampleSheet() - public method
  • external interface to class, runs the private methods that actually parse the sample sheet.
  • PopulateSampleSheetEntries()
    • Opens sample sheet as a StreamReader
    • skips any header section - start of data section is defined by line starting "Data" and followed by a line with column headings
    • Column headings are stored in a dictionary, along with the index of that header.
    • Then for subsequent lines, splits and adds each column to the appropriate header in ColumnHeaders dictionary
      • dictionary stores header name and also indexes it's column position
      • creates a temporary record for each sample, and pulls data by indexing into split line using positions from the ColumnHeaders dictionary (slightly convoluted, but clearer and more flexible than directly indexing with numbers.
      • Adds each temporary record to a list of SampleRecords (SampleRecord struct defined outside class)
  • CommaOnlyLine()
    • returns True is line only contains a comma. That's pretty much it.
  • GetExperimentName()
    • Looks through samplesheet file until it finds a line starting "Experiment Name"
    • splits line and saves second field as the experiment name
  • GetInvestigatorName()
    • Works as GetExperimentName(), but looks for "Investigator Name"
  • getAnalyses()
    • builds a list of analyses in a given sample sheet. Each list entry has the ROIfile (region of interest file, possibly a BED file?) and the pipeline for that sample - P or G.
    • sample sheet file is read in line-by-line:
      • empty lines are ignored
      • if line contains the word "Analysis" the stream is flagged as having passed the manifest header
      • the subsequent lines are split by commas, and the 0th and 1st fields are saved. These contain the ROIfile and the analysis type.
        • Saved in Analyses dict, so presumably there can't be different versions of the same analysis type on the same run?
        • There doesn't seem to be any link between analysis type and any specific sample - should all samples be the same??
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License