Class MultiTileBclParser

  • All Implemented Interfaces:
    Iterator<BclData>

    public class MultiTileBclParser
    extends Object
    Parse .bcl.bgzf files that contain multiple tiles in a single file. This requires an index file that tells the bgzf virtual file offset of the start of each tile in the block-compressed bcl file.
    • Field Detail

      • currentTile

        protected int currentTile
        The current tile number
    • Constructor Detail

      • MultiTileBclParser

        public MultiTileBclParser​(File directory,
                                  int lane,
                                  picard.illumina.parser.CycleIlluminaFileMap tilesToCycleFiles,
                                  OutputMapping outputMapping,
                                  boolean applyEamssFilter,
                                  BclQualityEvaluationStrategy bclQualityEvaluationStrategy,
                                  TileIndex tileIndex)
    • Method Detail

      • initialize

        public void initialize()
      • makeCycleFileParser

        protected picard.illumina.parser.PerTileCycleParser.CycleFilesParser<BclData> makeCycleFileParser​(List<File> files,
                                                                                                          picard.illumina.parser.PerTileCycleParser.CycleFilesParser<BclData> cycleFilesParser)
        For a given cycle, return a CycleFilesParser. It will close the cycleFilesParser if not null.
        Parameters:
        files - The file to parse
        cycleFilesParser - The previous cycle file parser, null otherwise.
        Returns:
        A CycleFilesParser that will populate the correct position in the IlluminaData object with that cycle's data.
      • makeCycleFileParser

        protected picard.illumina.parser.PerTileCycleParser.CycleFilesParser<BclData> makeCycleFileParser​(List<File> files)
        Create a Bcl parser for an individual cycle and wrap it with the CycleFilesParser interface which populates the correct cycle in BclData.
        Parameters:
        files - The files to parse.
        Returns:
        A CycleFilesParser that populates a BclData object with data for a single cycle
      • next

        public BclData next()
        Return the data for the next cluster by: 1. Advancing tiles if we reached the end of the current tile. 2. For each cycle, get the appropriate parser and have it populate it's data into the IlluminaData object.
        Specified by:
        next in interface Iterator<BclData>
        Returns:
        The IlluminaData object for the next cluster
      • runEamssForReadInPlace

        protected static void runEamssForReadInPlace​(byte[] bases,
                                                     byte[] qualities)
        EAMSS is an Illumina Developed Algorithm for detecting reads whose quality has deteriorated towards their end and revising the quality to the masking quality (2) if this is the case. This algorithm works as follows (with one exception):

        Start at the end (high indices, at the right below) of the read and calculate an EAMSS tally at each location as follow: if(quality[i] < 15) tally += 1 if(quality[i] >= 15 and < 30) tally = tally if(quality[i] >= 30) tally -= 2

        For each location, keep track of this tally (e.g.) Read Starts at <- this end Cycle: 1 2 3 4 5 6 7 8 9 Bases: A C T G G G T C A Qualities: 32 32 16 15 8 10 32 2 2 Cycle Score: -2 -2 0 0 1 1 -2 1 1 //The EAMSS Score determined for this cycle alone EAMSS TALLY: 0 0 2 2 2 1 0 2 1 X - Earliest instance of Max-Score

        You must keep track of the maximum EAMSS tally (in this case 2) and the earliest(lowest) cycle at which it occurs. If and only if, the max EAMSS tally >= 1 then from there until the end(highest cycle) of the read reassign these qualities as 2 (the masking quality). The output qualities would therefore be transformed from:

        Original Qualities: 32 32 16 15 8 10 32 2 2 to Final Qualities: 32 32 2 2 2 2 2 2 2 X - Earliest instance of max-tally/end of masking

        IMPORTANT: The one exception is: If the max EAMSS Tally is preceded by a long string of G basecalls (10 or more, with a single basecall exception per10 bases) then the masking continues to the beginning of that string of G's. E.g.:

        Cycle: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Bases: C T A C A G A G G G G G G G G C A T Qualities: 30 22 26 27 28 30 7 34 20 19 38 15 32 32 10 4 2 5 Cycle Score: -2 0 0 0 0 -2 1 -2 0 0 -2 0 -2 -2 1 1 1 1 EAMSS TALLY: -2 -5 -5 -5 -5 -5 -3 -4 -2 -2 -2 0 0 2 4 3 2 1 X- Earliest instance of Max-Tally

        Resulting Transformation: Bases: C T A C A G A G G G G G G G G C A T Original Qualities: 30 22 26 27 28 30 7 34 20 19 38 15 32 32 10 4 2 5 Final Qualities: 30 22 26 27 28 2 2 2 2 2 2 2 2 2 2 2 2 2 X- Earliest instance of Max-Tally X - Start of EAMSS masking due to G-Run

        To further clarify the exception rule here are a few examples: A C G A C G G G G G G G G G G G G G G G G G G G G A C T X - Earliest instance of Max-Tally X - Start of EAMSS masking (with a two base call jump because we have 20 bases in the run already)

        T T G G A G G G G G G G G G G G G G G G G G G A G A C T X - Earliest instance of Max-Tally X - We can skip this A as well as the earlier A because we have 20 or more bases in the run already X - Start of EAMSS masking (with a two base call jump because we have 20 bases in the run)

        T T G G G A A G G G G G G G G G G G G G G G G G G T T A T X - Earliest instance of Max-Tally X X - WE can skip these bases because the first A counts as the first skip and as far as the length of the string of G's is concerned, these are both counted like G's X - This A is the 20th base in the string of G's and therefore can be skipped X - Note that the A's previous to the G's are only included because there are G's further on that are within the number of allowable exceptions away (i.e. 2 in this instance), if there were NO G's after the A's you CANNOT count the A's as part of the G strings (even if no exceptions have previously occured) In other words, the end of the string of G's MUST end in a G not an "exception"

        However, if the max-tally occurs to the right of the run of Gs then this is still part of the string of G's but does count towards the number of exceptions allowable (e.g.) T T G G G G G G G G G G A C G X - Earliest instance of Max-tally The first index CAN be considered as an exception, the above would be masked to the following point: T T G G G G G G G G G G A C G X - End of EAMSS masking due to G-Run

        To sum up the final points, a string of G's CAN START with an exception but CANNOT END in an exception.

        Parameters:
        bases - Bases for a single read in the cluster ( not the entire cluster )
        qualities - Qualities for a single read in the cluster ( not the entire cluster )
      • seekToTile

        public void seekToTile​(int tile)
        Clear the current set of cycleFileParsers and replace them with the ones for the tile indicated by oneBasedTileNumber
        Parameters:
        tile - requested tile
      • hasNext

        public boolean hasNext()
        Specified by:
        hasNext in interface Iterator<ILLUMINA_DATA extends picard.illumina.parser.IlluminaData>
      • getTileOfNextCluster

        public int getTileOfNextCluster()
        Returns the tile of the next cluster that will be returned by PerTilePerCycleParser and therefore should be called before next() if you want to know the tile for the data returned by next()
        Returns:
        The tile number of the next ILLUMINA_DATA object to be returned
      • verifyData

        public void verifyData​(List<Integer> tiles,
                               int[] cycles)
      • remove

        public void remove()
        Specified by:
        remove in interface Iterator<ILLUMINA_DATA extends picard.illumina.parser.IlluminaData>
      • close

        public void close()