6. Junctools¶
Portcullis contains a tools suite called junctools for interpretting, analysing
and converting junction files. A description of all tools within
junctools can be displayed by typing junctools --help
at the command line:
usage: This script contains a number of tools for manipulating junction files.
[-h] {compare,convert,gtf,markup,set,split} ...
optional arguments:
-h, --help show this help message and exit
Junction tools:
{compare,convert,markup,set,split}
compare Compares junction files.
convert Converts junction files between various formats.
gtf Filter or markup GTF files based on provided junctions
markup Marks whether each junction in the input can be found in the reference or not.
set Apply set operations to two or more junction files.
split Splits portcullis pass and fail juncs into 4 sets (TP, TN, FP, FN) based on whether or not the junctions are found in the reference or not.
If this help message does not appear then please follow the instructions in the next section.
Note
Junctools is designed to support portcullis tab delimited files as well as many commonly used flavours of the bed format. Junctools will generally auto-detect the format based on the file extension and interpret the file as necessary.
6.1. Installation¶
If you have proceeded through the installation steps and have run make install
then junctools will be installed along with portcullis as a python package to the
location specified by --prefix
when running the configure
script. Should
prefix not be specified then junctools will be installed to /usr/local
. The
junctools library reside in the $(prefix)/lib/python3.x/site-packages
(where x
is the minor version of python installed on your system) directory and
the junctools executable will reside in $(prefix)/bin
. If you wish to install
junctools into a specific python environment or if you wish to install junctools
without portcullis, first make sure that you chosen environment has precedence on
your system then then cd <portcullis_dir>/scripts
, and then type python3 setup.py install
.
Note
When you type make uninstall
for portcullis you will also uninstall
junctools from the location specified by $(prefix). Keep this in mind if you
installed to a system directory you will if you prefix sudo
to the uninstall command.
Alternatively, if you installed junctools manually to a specific python environment then you
can uninstall junctools with: pip3 uninstall -y junctools
.
6.2. Compare¶
This tool compares one or more junction files against a reference file.
The usage information for this tool is as follows:
usage: Compares junction files.
[-h] [-s] [-l LABELS] [-m] reference input [input ...]
positional arguments:
reference The junction file to treat as the reference
input One or more junction files to compare against the
reference
optional arguments:
-h, --help show this help message and exit
-s, --use_strand Whether to use strand information when building keys
-l LABELS, --labels LABELS
Path to a file containing labels for the reference
indicating whether or not each reference junction is
genuine (as generated using the markup tool). If
provided this script produces a much richer
performance analysis. Not compatible with '--
multiclass'
-m, --multiclass Breakdown results into multiple classes: 1) Matching
intron 2) Two matching splice sites but no matching
intron (i.e. splice sites from different introns) 3)
One matching splice site 4) No matching splice sites
Essentially this tool can be run in one of three modes depending on the options
selected. The first mode does a information retrieval style comparison of the input
files against a reference, as it’s not possible to identify true negatives. The
output contains counts of junctions, plus true positives, false positives and false
negatives, along with precision, recall and F1 scores. The example below shows
output from comparing 6 bed files against a reference bed using the following command
junctools compare ../ref.bed *.bed
:
Reference:
- # total junctions: 158156
- # distinct junctions: 158156
File distinct total TP FP FN REC PRC F1
f1.bed 146800 146800 135570 11230 22586 85.72 92.35 88.91
f2.bed 146778 146800 0 146778 158156 0.00 0.00 0.00
f3.bed 130905 130905 129103 1802 29053 81.63 98.62 89.33
f4.bed 130905 130905 129103 1802 29053 81.63 98.62 89.33
f5.bed 118865 118865 117569 1296 40587 74.34 98.91 84.88
f6.bed 117613 117613 113952 3661 44204 72.05 96.89 82.64
Mean recall: 65.89
Mean precision: 80.90
Mean f1: 72.51
Note
Please keep in mind that if comparing predicted junctions from a real RNAseq sample against a reference that the sample may contain genuinely novel junctions that will appear as false positives. Generally we use this method on simulated datasets where we know the definitive set of true junctions.
The second mode provides a more detailed analysis which is useful for comparing portcullis results against marked up junctions from an alignment tool. In this case we markup junctions from an alignment tool using the markup tool . This is essentially a list specifying whether or not each junction found by the aligner is present in a reference or not. We can then compare results from portcullis against the marked up alignment junctions. This gives us a definitive set of false negative junctions, i.e. junctions from the aligner that were genuine but incorrectly marked as negative by portcullis.
Finally, the third mode is useful for comparing junctions from real RNAseq datasets against a real reference. This breaks down results into the 4 classes:
- 1 - Matching intron
- 2 - Two matching splice sites but no matching intron (i.e. splice sites from different introns)
- 3 - One matching splice site
- 4 - No matching splice sites
This approach allows the user to better understand the set of false positives produced from real datasets, and can give some indication of whether a junction is a novel junction or a false positive.
Note
By default we do not use strand information when determining the location
of a junction. To clarify, it is possible that bed file contains multiple junctions with
the same sequence, start and stop sites but with a different strand. By default
junctools compare
will collapse these as a duplicate junction. Although
not immediately intuitive this allows us to circumvent problems from junctions
that have unknown strand. This is important as some tools do not output strand
information. However, should you wish to disable this feature you can do so
with the --use_strand
option.
6.3. Convert¶
This can convert junction files between various commonly used formats. The conversion tool supports the following commonly used formats:
- bed = (Input only) BED format - we automatically determine if this is BED 6 or 12 format, as well as if it is intron, exon or tophat style).
- ebed = (Output only) Portcullis style exon-based BED12 format (Thick-start and end represent splice sites).
- tbed = (Output only) Tophat style exon-based BED12 format (splice sites derived from blocks).
- ibed = (Output only) Intron-based BED12 format.
- bed6 = (Output only) BED6 format (BED6 files are intron-based).
- gtf = (Input only) Transcript assembly or gene model containing transcript and exon features. NOTE: output will only contain junctions derived from this GTF.
- gff = (Input only) Transcript assembly or gene model containing introns to extract. NOTE: input must contain “intron” features, and output will only contain these introns represented as junctions.
- egff = (Output only) Exon-based junctions in GFF3 format, uses partial matches to indicate exon anchors.
- igff = (Output only) Intron-based junctions in GFF3 format
In addition we support the following application specific tab delimited formats:
- portcullis = Portcullis style tab delimited output.
- hisat = HISAT style tab delimited format.
- star = STAR style tab delimited format.
- finesplice = Finesplice style tab delimited format.
- soapslice = Soapsplice style tab delimited format.
- spanki = SPANKI style tab delimited format.
- truesight = Truesight style tab delimited format.
The usage information for the conversion tool looks like this:
usage: Converts junction files between various formats.
[-h] -if INPUT_FORMAT -of OUTPUT_FORMAT [-o OUTPUT] [-is] [-d] [-s]
[-r] [--index_start INDEX_START] [--prefix PREFIX] [--source SOURCE]
input
positional arguments:
input The input file to convert
optional arguments:
-h, --help show this help message and exit
-if INPUT_FORMAT, --input_format INPUT_FORMAT
The format of the input file to convert.
-of OUTPUT_FORMAT, --output_format OUTPUT_FORMAT
The output format.
-o OUTPUT, --output OUTPUT
Output to this file. By default we print to stdout.
-is, --ignore_strand Whether or not to ignore strand when creating a key for the junction
-d, --dedup Whether or not to remove duplicate junctions
-s, --sort Whether or not to sort the junctions. Note that sorting requires all junctions to be loaded into memory first. This maybe an issue for very large input files.
-r, --reindex Whether or not to reindex the output. The index is applied after prefix.
--index_start INDEX_START
The starting index to apply if the user requested reindexing
--prefix PREFIX The prefix to apply to junction ids if the user requested reindexing
--source SOURCE Only relevant if output is GFF format, use this option to set the source column in the GFF
Note
The user can also use the conversion tool to deduplicate, sort and reindex junction files.
6.4. GTF¶
Provides a means of manipulating or analysing GTF files using a junctions file. Three modes are currently supported, the first two filter and markup, will process a GTF file and either remove, or mark, transcripts and their associated exons, if junctions within that transcript are not supported by the provided junction file. In compare mode, we benchmark GTF files based on whether junctions present are found in a reference junction file. This gives junction-level accuracy statistics as well as transcript-level stats.
Usage information follows:
usage: This script contains a number of tools for manipulating junction files. gtf
[-h] [-is] -j JUNCTIONS [-o OUTPUT] mode input [input ...]
GTF modes:
filter = Filters out transcripts from GTF file that are not supported by the provided
junction file.
markup = Marks transcripts from GTF file with 'portcullis' attribute, which indicates
if transcript has a fully supported set of junctions, or if not, which ones are
not supported.
compare = For each GTF provided in the input. compare mode creates statistics describing
how many transcripts contain introns that are supported by a junction file.
positional arguments:
mode GTF operation to apply. See above for details. Available options:
- filter
- markup
- compare
input The input GTF file to convert
optional arguments:
-h, --help show this help message and exit
-is, --ignore_strand Whether or not to ignore strand when looking for junctions
-j JUNCTIONS, --junctions JUNCTIONS
The file containing junctions that should be found in the GTF.
-o OUTPUT, --output OUTPUT
The filtered or markedup GTF output. By default we print to stdout.
6.5. Markup¶
This tool marks whether each junction in the input can be found in the reference or not. Output from the tool is a line seperated list of 1’s (indicating junction is found in the reference) and 0’s (indicating the junction is not found in the reference). Output is written to a file with the same name as the input except a ‘.res’ extension is added. Usage information follows:
usage: Marks whether each junction in the input can be found in the reference or not.
[-h] [-o OUTPUT_DIR] [-s] reference input [input ...]
positional arguments:
reference The junction file to treat as the reference
input One or more junction files to compare against the
reference
optional arguments:
-h, --help show this help message and exit
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
If output dir is specified this will create output
files for each input file with a .res extension
indicating whether or not the junction was found in
the reference. By default we write out a .res file in
the same directory as the input file was found in.
-s, --use_strand Whether to use strand information when building keys
6.6. Set¶
Apply set operations to two or more junction files. This tool supports various different ways to apply set operations between junction files. First you can merge two or more junction files using the following modes:
- intersection = Produces the intersection of junctions from multiple input files
- union = Produces the union of junctions from multiple input files
- consensus = If there are 3 or more input files, the consensus operation produces a merged set of junctions where those junctions are found across a user-defined number of input files
Output from these modes potentially involves mergeing multiple junctions from various files into a single representative. When this occurs junction anchors are extended representing the most extreme extents found across all junctions at the given site. In addition, the junction score is modified according to the setting selected by the user, by default this involves summing the scores of all junctions, although the user can alternatively choose to take the min, max or mean of the values.
The following modes only support two input files and produce an output file containing junctions which are taken directly from the input without modification:
- subtract = Produces an output set of junctions containing all junctions present
- in the first input file that also are not found in the second file
- filter = Produces an output set of junctions containing all junctions present
- in the first input file that are also found in the second file. This is similar to an intersection on two files except that duplicates and additional content assigned to junctions in the first file are retained
- symmetric_difference =
- Produces an output set containing junctions from both input files that are not present in the intersection of both
In addition, these test modes also only support 2 input files and return either True or False depending on the test requested:
- is_subset = Returns True if all junctions in the first file are present in the second
- is_superset = Returns True if all junctions in the second file are present in the first
- is_disjoint = Returns True if there is a null intersection between both files
Usage:
usage: Apply set operations to two or more junction files.
[-h] [-m MIN_ENTRY] [--operator OPERATOR] [-o OUTPUT] [-p PREFIX] [-is]
mode input [input ...]
positional arguments:
mode Set operation to apply. Available options:
- intersection
- union
- consensus
- subtract
- symmetric_difference
- is_subset
- is_superset
- is_disjoint
input List of junction files to merge (must all be the same file format)
optional arguments:
-h, --help show this help message and exit
-m MIN_ENTRY, --min_entry MIN_ENTRY
Minimum number of files the entry is require to be in. 0 means entry must be
present in all files, i.e. true intersection. 1 means a union of all input files
--operator OPERATOR Operator to use for calculating the score in the merged file.
This option is only applicable to 'intersection', 'union' and 'consensus' modes.
Available values:
- min
- max
- sum
- mean
-o OUTPUT, --output OUTPUT
Output junction file. Required for operations that produce an output file.
-p PREFIX, --prefix PREFIX
Prefix to apply to name column in BED output file
-is, --ignore_strand Whether or not to ignore strand when creating a key for the junction
6.7. Split¶
This tool splits portcullis pass and fail juncs into 4 sets (TP, TN, FP, FN) based on whether or not the junctions are found in the reference. The pass and fail files passed into this tool should be disjoint in order to get meaningful results.
Usage:
usage: Splits portcullis pass and fail juncs into 4 sets (TP, TN, FP, FN) based on whether or not the junctions are found in the reference or not.
[-h] [-o OUTPUT_PREFIX] reference passfile failfile
positional arguments:
reference The reference junction file
passfile The junction file containing junctions that pass a
filter
failfile The junction file containing junctions failing a
filter
optional arguments:
-h, --help show this help message and exit
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
Prefix for output files