5. Metrics

During the junction making stage, Portcullis generates a list of potential splice junctions by assuming any RNAseq alignment containing an ‘N’ cigar operation. For all potential junctions portcullis makes a number of observations about how the reads align around the region. The list of observations (or metrics) generated at this stage are described in here. In addition, should the user wish to spend additional computational power calculating additional metrics these are listed here.

We are able to create high confidence sets of genuine and invalid junctions using rules based on these metrics. However, this isn’t the full set of metrics used for filtering. Using the high confidence sets we are able to extract additional features. These are described here.

Finally, not all metrics are used by the machine learning algorithm because we found that either some metrics do not provide any additional information and are inferior to others. We went through a process of feature selection to settle on a set of features that can be usefully applied by our machine learning algorithm. These are listed here.

Before we do that, here is a small glossary of terms that we will use throughout:

  • Splice junction - Splice junctions are points on a DNA sequence at which ‘superfluous’ DNA is removed during the process of protein creation in higher organisms. For the purposes of this tool a splice junction is essentially the same as an intron.
  • Splice site - A junction has both a 5’ donor and 3’ acceptor site, which mark the start and end of the junction. Both donor and acceptor sites are 2bp long, and usually contain a canonical motif GT*AG, or its reverse complement on the negative strand.
  • Intron - Any nucleotide sequence within a gene that is removed by RNA splicing while the final mature RNA product of a gene is being generated. For the purposes of this tool an intron is essentially the same as a splice junction.
  • Anchor - The part of a spliced read that aligns to the genome. This will represent regions both upstream and downstream of the splice junction.
  • Multiple Spliced Reads (MSRs) - Reads that cover more than a single spliced junction. These types of reads will become more common as sequencers become capable of producing longer reads.

5.1. Junction metrics

These metrics are calculated in the junction analysis stage for the portcullis pipeline and are derived directly from the BAM file generated by the RNAseq mapper.

5.1.1. Splice site type (canonical_ss)

Most splice junctions contain distinctive dinucleotide pattern at the start and end of the splice junction. These patterns are called donors and acceptors respectively. If the potential junction has the typical GT_AG motif (or its reverse complement on the negative strand then it is considered a canonical splice site. Alternatively, if the junction contains either AT_AC or GC_AG or their reverse complements then it is considered a semi-canonical splice site. Other motifs at the donor and acceptor sites are considered novel splice sites.

5.1.2. Number of spliced alignments (nb_raw_aln)

This a count of the number of reads containing an ‘N’ cigar operation at exactly the genomic position represented by this junction. Because we use ‘N’ cigar operations to derive all potential junctions each junction will have a value of 1 or more for this metric.

5.1.3. Number of distinct alignments (nb_dist_aln)

This is the number of distinct / non-redundant spliced reads supporting the junction. Therefore duplicate reads are only counted as a single read for this metric.

5.1.4. Number of uniquely / multiply spliced alignments (nb_us_aln, nb_ms_aln)

These are counts of the number of spliced alignments that support this junction that either do or do not also support another junction.

5.1.5. Number of uniquely / multiply mapped alignments (nb_um_aln, nb_mm_aln)

These are counts of the number of spliced reads that uniquely mapped to the genome, or mapped to multiple sites in the genome. We use the MAPQ value in the alignment to determine if a read is uniquely mapped or not. We treat anything with a value over 30 as being uniquely mapped and anything less as being multi mapped. This should be a reliable threshold as most aligners will output MAPQ scores of 0, 1, 2 or 3 and then another high value, e.g. 50 or 60 to represent a unique read.

5.1.6. Number of BAM / Portcullis properly paired alignments (nb_bpp_aln, nb_ppp_aln)

The number of alignments that are properly paired, according to either the BAM file flag, or calculated by portcullis. Some aligners, such as tophat do not always produce useful figures for this property due to insert size limits being affected by introns. Other aligners such as STAR treat all paired reads as properly paired. Therefore it’s not useful to rely to heavily on this value unless you fully understand what the aligner is doing. To address this we make our own calculation of whether a alignment is properly paired. To do this we require the user to pass in the orientation of the reads. Typically, FR configuration for illumina RNAseq data. This allows us to do some basic checks like whether the pairs on on the same target sequence and if the orientation of the pairs makes sense according to what the user specified.

If the user does not specify an orientation for the reads, or if the data is single end, then the nb_ppp_aln will be 0.

5.1.7. Number of Reliable Alignments (nb_rel_aln, rel2raw)

This is the number of spliced reads supporting this junction that probably do not map elsewhere in the genome and are properly paired (according to portcullis calculations). If the user does not specify an orientation for the reads then this will be the same as the number of uniquely mapped reads.

The ratio of reliable reads to raw (nb_raw_aln) reads gives a good indication of whether the junction is genuine or not. Low ratios (near 0) indicate potentially unreliable junctions and high ratios (near 1) indicate potentially reliable junctions.

5.1.8. Shannon entropy (entropy)

This describes the shannon entropy of the spliced reads associated with this junction. This score is a measure of the amount of information present in the set of spliced reads supporting this junction. This metric is used to avoid problems attributed to calling splice junctions based on read counting alone, when read counting each read is assigned equal weight, even if they all start at the same position. Typically, you would expect a uniform distribution of starting positions for reads across the upstream anchor of the splice site, therefore a situation where all reads are stacked on top of one another should be treated as suspicious. Simply counting reads also makes it difficult to assign good minimum threshold values at which to call genuine junctions. The Entropy metric circumvents these problems. The entropy score is a function of both the total number of reads that map to a given junction, the number of different offsets to which those reads map and the number that map at each offset. Thus, junctions with multiple reads mapping at each of the possible windows across the junction will be assigned a higher entropy score, than junctions where many reads map to only one or two positions.

Although very useful, one disadvantage of the entropy score is that it does not take into account the quality of the reads contained within it, for example the number of mismatches present.

Entropy for each junction \(j\) is calculated based on the starting offsets of split reads supporting the junction. The following equations:

\[p_{i} = r_{i} / T\]
\[H_{j(s,e)} = - \sum_{i=s}^{e}(p_{i} \log_{2} p_{i})\]

where:

  • \(j(s,e)\) defines the left anchor region of the junction, starting at \(s\) and ending at \(e\)
  • \(r_i\) is the number of split reads supporting the junction that start at offset \(i\)
  • \(T\) is the total number of split reads supporting the junction

Shannon Entropy scores are also used in TrueSight and SPANKI.

5.1.9. Mean mismatches (mean_mismatches)

This is the mean number of mismatches found across all spliced reads supporting the junction. This includes any mismatches at any point along the spliced read, which includes mismatches even if they are the otherside of another junction in the case of an MSR. Originally described in TrueSight paper.

5.1.10. Mean read length (mean_readlen)

The average read length for spliced alignments supporting this junction.

5.1.11. Anchor lengths (max_min_anc, maxmmes)

This is the maximum of the minimum (shorter) anchor sizes of all spliced reads associated with this junction. Again, all upstream and downstream junctions contained within MSRs are collapsed for the purposes of this metric. Originally used in TrueSight

MaxMMES (Maximum of the Minimal Match of Either Side of exon junction) takes into account mismatches in the anchors on either side of the junction. For each spliced read associated with the junction, we look at both anchors. The score for each anchor is the anchor length minus any mismatches to the reference. The minimal score from either the upstream or downstream anchor is taken. Then from these scores the maximum is taken from all spliced reads. The MaxMMES for perfectly aligned reads should be the same as Max-Min-Anchor score. Therefore the difference between the two metrics is worth considering to gain an insight into how well the reads are mapping for a given junction. Originally described in Wang et al, 2010

5.1.12. Hamming distance at 5’ and 3’ splice sites (hamming5p, hamming3p)

Aligners can often make incorrect alignments around repeated genomic locations. In these instances it is good to know whether the region on the on the left side of the donor site and the left side of the acceptor site, in addition to the region on the right side of the donor site and the right side of the acceptor site are similar. In this is the case then it is likely that the false splice alignments have been made. We record both figures in terms of the hamming distances between the regions. Low scores indicate similarity, and therefore high change of alignment to a repeat region, high scores indicate difference and therefore low chance of alignment to a repeat region. Originally used in SPANKI.

_images/hamming.png

5.1.13. Junction group metrics (uniq_junc, primary_junc)

We analyse junctions within a given loci to determine if they are isolated or have other junctions sharing splice sites or otherwise overlapping. The uniq_junc boolean metric determines whether or not there are any other junctions within this junctions region. This helps to determine if this junction might be involved in alternative splicing.

In addition, if this is not a unique junction, then this is a primary junction if it has the most spliced reads when compared to the other junctions in this grouping. If this is a unique junction, then it is also a primary junction.

5.1.14. Number of Upstream and Downstream Junctions (nb_up_juncs, nb_down_juncs)

The number of upstream and downstream junctions contained within any MSRs associated with this junction. Will be 0 for junctions without any MSRs.

5.1.15. Distance to nearest Upstream and Downstream Junctions (dist_2_up_junc, dist_2_down_junc, dist_nearest_junc)

Specifies the distance to the nearest junction detected upstream and downstream respectively (and the minimum or both).

5.1.16. Split Read Overhangs across each junction (JADxx)

Additional columns in the tab file are provided to represent the quantity of split read overhangs across each junction, up to 20bp upstream or downstream. This is similar, but more restricted, than the implementation in finesplice. The reason for the restriction is to ensure a consistent set of metrics (20) for all read lengths. The idea of this set of metrics in general is to provide a more finegrained indication of the entropy of the junction.

\[O_{j}^{i} = \min(L_{j}^{i}, r_{j}^{i})\]

where:

  • \(L_{j}^{i}\) defines the length of the left arm of read \(i\) across the junction \(j\), trimmed to the first mismatching position
  • \(R_{j}^{i}\) defines the length of the right arm of read \(i\) across the junction \(j\), trimmed to the first mismatching position

We increment a vector \(N_{i}^{j}\) where i ranges from 1 to 20 for each junction representing pileups of \(O_{j}^{i}\).

Using this vector we are able to provide some potential indications of whether the junction is genuine or not. To this end we have to columns marked: suspicious and pfp (for potential false positive).

5.2. Extra metrics

These metrics are only calculated if the user specifies that they wish to do additional processing. Note, that calculating these metrics can take a significant amount of time and has no direct influence on downstream filtering. It only makes sense to request extra processing if the user is specifically interest in these metrics.

5.2.1. Multiple Mapping Score (mm_score)

The multiple mapping score is the number of spliced reads associated with the junction divided by the number of times those same reads are found mapped anywhere in the genome. Therefore a score of 1 indicates that all spliced reads associated with the junction are only found in this junction. A low score would indicate that the those reads map to multiple locations across the genome. Originally described in TrueSight paper.

5.2.2. Unspliced coverage around junction (coverage)

When considering unspliced reads around a junction site, you would typically expect to see a tailing off of reads towards the 5’ junction boundary, and a ramping up after the 3’ junction boundary. However, in practice this is complicated by MSRs, alternative splicing and junctions near sequence ends.

5.2.3. Number of upstream and downstream flanking alignments (up_aln, down_aln)

This is a count of the number of unspliced reads aligning upstream of the splice junction, that overlap with the upstream anchor. Caution must be taken interpreting this metric closely packed introns could mean the presence of MSRs exclude the possibility of getting any unspliced upstream alignments. In addition, if the junction is close to the sequence start, it maybe that no unspliced upstream alignments are possible either.

5.2.4. Number of Samples

Portcullis can only be used on a single sample and will therefore always set to 1. Note that although portcullis can take multiple BAM files as input it will merge them at treat them as a single sample. However, downstream of portcullis it’s useful to have a placeholder for the number of samples if applying set operations to multiple junction files. For example, if you were to create a union of 5 junction files it can be useful to know how many of those files contained the junction. That information is put into this metric.

5.3. Extracted metrics

By applying a set of rules based to junctions annotated with the metrics described in the previous section it is possible to define a subset of valid and invalid junctions with very high precision. However, there will invariably be many junctions left over that do not fit into either category. To assist with categorising the remaining junctions we use information from the high confidence sets to create additional metrics which are then calculated for all junctions. These extra metrics are described here:

5.3.1. Intron Score

Generally, long introns are not valid but mean intron lengths deviate wildly between species, hence we can’t reliably filter on this criteria a priori. By scanning the positive set we can find the length of the intron at the 95th percentile \(L_{95}\) and then use this as a starting point for when junctions with excessively large introns look suspicious.

If \(L^{j} < L_{95}\) then we assign a score of 0, otherwise we assign a score of \(-ln(L^{j} - L_{95})\).

Metric originally used in TrueSight.

5.3.2. Splicing signal

By analysing the makeup of the genome around junctions in both the positive and negative sets we can try to get an idea whether certain genomic features are indicative of genuine junctions or not. A commonly used method to do this is to build markov models for k-mers upstream and downstream of the donor and acceptor splice sites in the junctions.

\[\begin{split}SS_{j(s,e)} = ln \sum_{i = s-3+k}^{s+19} \frac{P_{td}(X_{i}|X_{i-k}...X_{i-1})}{P_{fd}(X_{i}|X_{i-k}...X_{i-1})} \\ + ln \sum_{i = e-20+k}^{e+19} \frac{P_{ta}(X_{i}|X_{i-k}...X_{i-1})}{P_{fa}(X_{i}|X_{i-k}...X_{i-1})}\end{split}\]

where:

  • \(P_{td}\) is the probability of a true donor given the following sequence
  • \(P_{fd}\) is the probability of a false donor given the following sequence
  • \(P_{ta}\) is the probability of a true acceptor given the following sequence
  • \(P_{fa}\) is the probability of a false acceptor given the following sequence

Metric originally used in TrueSight.

5.3.3. Log deviation between expected and observed junction overhangs

We extend the observed pileup counts found by Split Read Overhangs across each junction (JADxx) to represent the log deviation between the observed and expected counts at each position in the junction to give us more discriminative power across datasets. To do this we use the following formula:

\[x_{i}^{j}=\log_{2}(\frac{N_{i}^{j}}{E_{i}^{j}})\]

where:

  • \(E_{i}^{j}\) is the expected read count at this position in the junction assuming a uniform distribution of all observed split reads for this junction.

Note

Yes, theoretically this could be calculated in the junction analysis stage of portcullis!

5.4. Final metrics

Not all metrics turned out to be useful for determining whether a junction is genuine or not. We went through a process of feature selection and settled on the final set of metrics used in the machine learning part of portcullis. Those are listed here:

The output from the machine learning component is a probability score representing the liklihood that the junction is genuine or not. That value is represented by the score metric.