IgQUAST manual

1. What is IgQUAST?
    1.1. Input
    1.2. Output
2. Installation
3. IgQUAST usage
    3.1. Input options
    3.2. Output options
    3.3. Performed scenarios
    3.4. Miscellaneous options
    3.5. Examples
    3.6. Output files
4. Citations
5. Feedback and bug reports

1. What is IgQUAST?

IgQUAST (ImmunoGlobulin QUality ASsessment Tool) is a tool for adaptive immune repertoires quality assessment. IgQUAST can be used for benchmarking of adaptive immune repertoire construction tools and for quality estimation of constructed repertoires. IgQUAST performs reference-based and reference-free analysis:

During reference-based analysis the tool compares two input repertoires: the reference repertoire and the constructed repertoire. The analysis is separated into two scenarios:
- Repertoire-to-repertoire matching only uses repertoire sequences. The tool aligns each of two repertoires against the other one and computes sensitivity and precision metrics, detects error positions in erroneously constructed sequences, and compares reference and constructed abundances for ideally reconstructed sequences.
- Partition-based analysis only uses partitions induced by the RCMs (read-to-cluster maps). The tool compares two partitions and computes partition similarity metrics (like Rand index). Also it computes cluster quality measures (like purity and discordance) and plots their distributions for both input repertoires.
Reference-free analysis is performed on the constructed repertoire. The tool detects overestimated clusters in the repertoire using amplification-free Poisson model. It also estimates error rate and error profile of the initial read library. The same analysis is performed on the reference repertoire if it is provided. All reference-free analysis requires both repertoire sequences and read-to-cluster map (RCM).

1.1. Input

IgQUAST takes as an input:

Initial Rep-seq read library;
Analyzed adaptive immune repertoire constructed on this library;
For reference-based analysis, the reference repertoire constructed on the same library.

Initial Rep-seq library should be in FASTA or FASTQ format. Reads should be properly cropped (should start from V gene beginning and finish by J gene ending), reads obtained from negative strand should be reversed, and contaminative reads should be filtered out. Cropping, strand correction and contamination filtering may be performed by the VJFinder tool from the IgRepertoireConstructor package (see IgRepertoireConstructor manual for VJFinder input-output format description).
Analyzed constructed and reference repertoires should be presented by two files each, repertoire sequences in FASTA format and read-to-cluster map (RCM) file in a special format. See IgRepertoireConstructor manual for the comprehensive repertoire format description. If you have only one of these files, the tool can reconstruct it using available information (use --reconstruct option).

1.2. Output

IgQUAST reports:

Plots for visual analysis;
Metrics report.

Plots are reported in PNG, PDF, and SVG formats. Metrics are reported in text (brief) and JSON (full) formats.

2. Installation

IgQUAST has the following dependencies:

64-bit Linux system
g++ (version 4.7 or higher)
cmake (version 2.8.8 or higher)
Python 2 (version 2.7 or higher), including:
- BioPython
- Matplotlib
- NumPy
- SciPy
- pandas
- Seaborn

IgQUAST is a part of IgRepertoireConstructor package. See IgRepertoireConstructor manual for the installation instructions.

Please verify your IgQUAST installation before the first run of IgQUAST:


    ./igquast.py --test

If the installation is succeeded, you will find the following information at the end of the log:


  Thank you for using IgQUAST!
  Log was written to igquast_test/igquast.log

3. IgQUAST usage

To run IgQUAST, type:

    
    ./igquast.py [options] -s <initial reads> -c <constructed repertoire FASTA> -C <constructed repertoire RCM> -r <reference repertoire FASTA> -R <reference repertoire RCM> -o <output dir for plots>

3.1. Input options

-c / --constructed-repertoire <constructed repertoire FASTA>
FASTA file with constructed repertoire sequences. Can be gzipped.

-C / --constructed-rcm <constructed repertoire RCM>
RCM file with constructed repertoire read-cluster map. Can be gzipped.

-r / --reference-repertoire <reference repertoire FASTA>
FASTA file with reference repertoire sequences. Can be gzipped.

-R / --reference-rcm <reference repertoire RCM>
RCM file with reference repertoire read-cluster map. Can be gzipped.

-s / --initial-reads <initial reads>
Initial Rep-seq reads in FASTA or FASTQ format. Can be gzipped.

--reconstruct | --no-reconstruct
Whether to reconstruct missing repertoire files if it is possible. Disabled by default.

3.2. Output options

-o / --output-dir <output dir>
output directory (required).

--text <text report file>
File for text report output. Default: <output dir>/igquast.txt.

--json <JSON report file>
File for JSON report output. Default: <output dir>/igquast.json.

-F / --figure-format <figure formats(s)>
Figure format(s) for plots. Allowed values are png, pdf and svg. One can pass several values separated by commas. Empty string means do not produce plots. Default value is png,pdf,svg.

3.3. Performed scenarios

--repertoire-to-repertoire-matching | --no-repertoire-to-repertoire-matching
Whether to perform repertoire-to-repertoire matching. Enabled by default.

--partition-based | --no-partition-based
Enable/disable partition-based metrics and plots. Enabled by default.

--reference-free | --no-reference-free
Enable/disable reference-free metrics and plots. Disabled by default.

--export-bad-clusters | --no-export-bad-clusters
Whether to export untrustworthy clusters during reference-free analysis. Disabled by default.

3.4. Algorithm parameters

--reference-size-cutoff <positive integer>
Cutoff for reference cluster size. Smaller reference clusters are discarded during repertoire-to-repertoire comparison. Default value is 5.

3.4. Miscellaneous options

--test
Running on the toy test dataset. Command line corresponding to the test run is equivalent to the following:

    
    ./igquast.py -s igquast_test_dataset/test/input_reads.fa.gz -c igquast_test_dataset/igrec/final_repertoire.fa.gz -C igquast_test_dataset/igrec/final_repertoire.rcm -r igquast_test_dataset/test/repertoire.fa.gz -R igquast_test_dataset/test/repertoire.rcm -o igquast_test_test

-h / --help
Show help and exit.

3.5. Examples

Perform reference-free analysis only:

    
    ./igquast.py -s igquast_test_dataset/test/input_reads.fa.gz -c igquast_test_dataset/test/igrec_bad/final_repertoire.fa.gz -C igquast_test_dataset/test/igrec_bad/final_repertoire.rcm -o igquast_test --reference-free

Do not plot figures, make reports only:

    
    ./igquast.py -s igquast_test_dataset/test/input_reads.fa.gz -c igquast_test_dataset/test/igrec/final_repertoire.fa.gz -C igquast_test_dataset/test/igrec/final_repertoire.rcm -r igquast_test_dataset/test/repertoire.fa.gz -R igquast_test_dataset/test/repertoire.rcm --figure-format= -o igquast_test

3.6. Output files

IgQUAST creates output directory (its name is specified using option -o) and outputs the following files there:

reference_based — Directory with reference-based plots:
- error_position_distribution — distribution of error positions for constructed repertoire sequences reconstructed with only one error. This plot helps to detect sequencing technology artifacts and repertoire construction strategy artifacts
- sensitivity_precision — sensitivity and precision depending on cluster size threshold for the constructed repertoire
- distance_distribution — constructed to reference and reference to constructed distance distribution depending on the cluster size threshold for the constructed repertoire. 8 plots on one figure
- {constructed_to_reference,reference_to_constructed}_distance_distribution_size_{1,3,5,10} — the same 8 plots separately
- abundance_distributions — cluster size distribution for the constructed and reference repertoires
- abundance_distributions_log — the same plot with logarithmic Y-scale
- cluster_abundances_scatterplot — scatterplot of constructed cluster sizes against reference cluster sizes for ideally reconstructed clusters
- constructed_purity_distribution — distribution of cluster purity in the constructed repertoire. This plot helps to detect overcorrection
- constructed_purity_distribution_large — the same plot for large clusters only
- reference_discordance_distribution — distribution of constructed cluster discordance (relative contribution of the second popular reference cluster into the particular constructed cluster). This plot helps to detect overcorrection
- reference_discordance_distribution_large — the same plot for large clusters only
- reference_purity_distribution — distribution of cluster purity in the reference repertoire. This plot helps to detect undercorrection
- reference_purity_distribution_large — the same plot for large clusters only
- constructed_discordance_distribution — distribution of reference cluster discordance (relative contribution of the second popular constructed cluster into the particular reference cluster). This plot helps to detect overcorrection
- constructed_discordance_distribution_large — the same plot for large clusters only
reference_free — Directory with reference-free plots:
- constructed_cluster_error_profile — error profile estimation for entire constructed repertoire
- constructed_cluster_error_profile_largest_{1,2,3,4,5} — error profile estimation for 5 largest constructed clusters separately
- constructed_cluster_discordance_profile — discordance (second vote letter) profile for entire constructed repertoire
- constructed_cluster_error_profile_largest_{1,2,3,4,5} — discordance profile for 5 largest constructed clusters
- constructed_distribution_of_errors_in_reads — distribution of the number of errors in reads for the constructed repertoire clusters
- constructed_max_error_scatter — scatterplot of max errors by position against cluster size for the constructed repertoire. This plot helps to detect overcorrection
- reference_* — the same plots for the reference repertoire
- bad_constructed_clusters — Directory with clusters from the constructed repertoire, which are identified as overcorrected during reference-free analysis
- bad_reference_clusters — Directory with clusters from the reference repertoire, which are identified as overcorrected during reference-free analysis
constructed.rcm, reference.rcm, constructed.fa.gz. reference.fa.gz — Reconstructed missing repertoire files
igquast.log — full log of the IgQUAST run

Files for the reports (in text and JSON formats) are specified by the corresponding options. Some files can be absent depending on provided input. Note that reference-free analysis is disabled by default since it is very time- and memory-consuming. One should use the option --reference-free to enable it.

4. Citations

Alexander Shlemov, Sergey Bankevich, Andrey Bzikadze, Dmitriy M. Chudakov, Yana Safonova, and Pavel A. Pevzner. Reconstructing antibody repertoires from error-prone immunosequencing datasets (submitted)

5. Feedback and bug reports

Your comments, bug reports, and suggestions are very welcome. They will help us to further improve IgQUAST.

If you have any trouble running IgQUAST, please provide us the log file from the output directory.

Address for communications: igtools_support@googlegroups.com.