IgRepertoireConstructor 2.1 manual

1. What is IgRepertoireConstructor?
    1.1 About IgReC
    1.2 About MassSpectraAnalyzer
2. Installation
    2.1. Verifying your installation
3. IgReC usage
    3.1. Basic options
    3.2. Advanced options
    3.3. Examples
    3.4. Output files
4. MassSpectraAnalyzer usage
    4.1. Basic options
    4.2. Advanced options
    4.3. Examples
    4.4. Output files
5. Examples
6. Antibody repertoire representation
    6.1. CLUSTERS.FASTA file format
    6.2. RCM file format
    6.3. Alignment Info file format
7. Feedback and bug reports
    7.1. Citation

1. What is IgRepertoireConstructor?

IgRepertoireConstructor is a tool for construction of antibody repertoire and immunoproteogenomics analysis. IgRepertoireConstructor pipeline consists of two parts:

IgReC — a tool for construction of full-length antibody repertoire from Illumina Ig-Seq library.
MassSpectraAnalyzer — a tool for analysis of matching mass spectra against constructed repertoire.

About IgReC

IgReC pipeline in shown below:

igrec_pipeline

Input:

IgReC takes as an input paired-end or single Illumina reads. Please note that IgRepertoireConstructor constructs full-length repertoire and expects that input reads cover variable region of antibody.

Output:

IgReC corrects sequencing and amplification errors and joins together reads corresponding to identical antibodies. Thus, constructed repertoire is a set of antibody clusters characterizing by sequence and multiplicity. IgReC provides user with the following information about constructed repertoire:

antibody clusters: antibody sequences with multiplicities,
read-cluster map: information showing how reads form antibody clusters,
highly abundant antibody clusters,
super reads: groups of identical input reads with high coverage.

Stages:

IgReC pipeline consists of the following steps:

VJ Finder: cleaning input reads using alignment against Ig germline genes
HG Constructor: construction of Hamming graph on cleaned reads
Dense Subgraph Finder: finding dense subgraphs (or corrupted cliques) in the constructed Hamming graph. This allows us to decompose cleaned reads into highly similar groups. DSF algorithm is also available as a separate tool. Please find DSF manual here.
Antibody Constructor: construction of antibody clusters based on graph decomposition found at the previous stage.

You can find details of IgReC algorithm in our paper.

About MassSpectraAnalyzer

Some immunological experiments include preparation both sequencing reads and mass spectra (see examples in Cheung et al, 2012, Nature; Safonova, Bonissone et al, Bioinformatics, 2015). In this case, mass spectra datasets can be used for validation of repertoire constructed from sequencing reads. Repertoire constructed by IgReC can be used as a database for identification of mass spectra using some standard tool, e.g., MS-GF+. MassSpectraAnalyzer takes as an input mzIdentML file and computes similarity between constructed repertoire and mass spectra.

ms_analysis_pipeline

2. Installation

IgRepertoireConstructor has the following dependencies:

64-bit Linux system
g++ (version 4.7 or higher)
cmake (version 2.8.8 or higher)
Python 2 (version 2.7 or higher), including:

To install IgRepertoireConstructor, type:

    
    make

2.1. Verifying your installation

For testing purposes, IgReC and MassSpectraAnalyzer come with toy data sets.

► To try IgReC on the test data set, run:


    ./igrec.py --test

If the installation of IgReC is successful, you will find the following information at the end of the log:

    
    Thank you for using IgReC!
    Log was written to igrec_test/ig_repertoire_constructor.log

► To try MassSpectraAnalyzer on test data set, run:


    ./mass_spectra_analyzer.py --test

If the installation of MassSpectraAnalyzer is successful, you will find the following information at the end of the log:

    
    Spectra processed: example_HC_chymo_CID.mzid.spectra, example_HC_trypsin_CID.mzid.spectra
    Metrics written to <your_installation_directory>/ms_analyzer_test/metrics.txt
    Covered CRDs written to <your_installation_directory>/ms_analyzer_test/covered_cdrs.txt
    PSM on IG regions written to <your_installation_directory>/ms_analyzer_test/psm_on_ig_regions.txt
    Figures and statistics saved in <your_installation_directory>/ms_analyzer_test

3. IgReC usage

IgReC takes as an input Illumina reads covering variable region of antibody and constructs repertoire in CLUSTERS.FA and RCM format.

To run IgReC, type:

    
    ./igrec.py [options] -s <single_reads.fastq> -o <output_dir>

    
    ./igrec.py [options] -1 <left_reads.fastq> -2 <right_reads.fastq> -o <output_dir>

3.1. Basic options:

-s <single_reads.fastq>
FASTQ file with single Illumina reads (required).

-1 <left_reads.fastq> -2 <right_reads.fastq>
FASTQ files with paired-end Illumina reads (required).

-o / --output <output_dir>
Output directory (required).

-t / --threads <int>
The number of parallel threads. The default value is 16.

--test
Running on the toy test dataset. Command line corresponding to the test run is equivalent to the following:

    
    ./igrec.py -s test_dataset/merged_reads.fastq -l all -o igrec_test

--help
Printing help.

3.2. Advanced options:

--loci / -l <str>
Immunological loci to align input reads and discard reads with low score (required).
Available values are IGH / IGL / IGK / IG (for all BCRs) / TRA / TRB / TRG / TRD / TR (for all TCRs) or all.

--no-pseudogenes
Do not use pseudogenes along with normal gene segments for VJ alignment. By default, IgReC uses pseudogenes for aligning reads.

--organism <str>
Organism. Available values are human, mouse, pig, rabbit, rat and rhesus_monkey. Default value is human.

--tau <int>
Maximum allowed number of mismatches between two reads corresponding to identical antibodies. The default (and recommended) value is 4. Higher values give higher sensitivity of error correction algorithm but increase running time. Reasonable value of tau lies between 1 and 6.

-n / --min-sread-size <int>
Minimum size of super read. Super read is a group of identical input reads with high coverage. IgReC considers that super reads present error free clusters and does not glue them together. If input data set was highly amplified, we recommend to increase value of this option. Default value is 5.

--min-cluster-size <int>
Minimal size of antibody cluster used for output of large clusters. Default value is 5.

3.3. Examples

To construct antibody repertoire from single reads reads.fastq, type:

    
    ./igrec.py -s reads.fastq -o output_dir

3.4. Output files

IgReC creates working directory (which name was specified using option -o) and outputs the following files there:

Final repertoire files:

final_repertoire.fa — CLUSTERS.FASTA file for all antibody clusters of the constructed repertoire (details in Antibody repertoire representation).
final_repertoire_large.fa — CLUSTERS.FASTA file for highly abundant antibody clusters of the constructed repertoire (minimal cluster size is defined by option --min-size)
final_repertoire.rcm — RCM file for the constructed repertoire (details in Antibody repertoire representation).
super_reads.fa — FASTA file containing super reads, i.e., large groups of identical input reads, Minimal size of super read is defined by option --min-sread-size.

VJ finder output:

vj_finder/cleaned_reads.fa — FASTA file with cleaned reads constructed at the VJ Finder stage. Cleaned reads have forward direction (from V to J), contain V and J gene segments and are cropped by the left bound of V gene segment.
vj_finder/filtered_reads.fa — FASTA file with filtered reads. Filtered reads have bad alignment to Ig germline gene segments and are likely to present contaminations.
vj_finder/alignment_info.csv — CSV file containing information about alignment of cleaned reads to V and J gene segments. Details of alignment_info.csv format are given in Alignment Info file format.

igrec.log — full log of IgReC run.

4. MassSpectraAnalyzer usage

MassSpectraAnalyzer takes as an input result of matching of mass spectra against the constructed repertoire in mzIdentML 1.1 format (e.g., generated by MS-GF+) and computes multiple statistics showing coverage of the constructed repertoire by mass spectra.

► To run MassSpectraAnalyzer, type:


    ./mass_spectra_analyzer.py [options] -o <output_dir> input_file_1.mzid ... input_file_N.mzid

4.1. Basic options:

input_file_1.mzid ... input_file_N.mzid
Input files with mass spectra alignment to protein database in mzIdentML 1.1 format.

-o <output_dir>
output directory (required).

--test
Running on the toy test data set.

--help, -h
Printing help.

4.2. Advanced options:

--regions <filename>
File with information about framework and CDRs for protein sequences from used database in IgBLAST format. Example of file with labeled regions is given below:

    
    Query= Antibody_sequence_1
    CDR2-IMGT       51      58
    FR2-IMGT        34      50
    FR1-IMGT        1       25
    FR3-IMGT        59      94
    CDR1-IMGT       26      33
    CDR3-IMGT       95      110
    Query= Antibody_sequence_2
    CDR2-IMGT       51      58
    FR2-IMGT        34      50
    FR1-IMGT        1       25
    FR3-IMGT        59      94
    CDR1-IMGT       26      33
    CDR3-IMGT       95      111

where Antibody_sequence_1 and Antibody_sequence_2 are sequences from database.

4.3. Examples:

► To compute statistics for both chymo and trypsin mass spectra datasets and labeled regions, run the following command:

    
    ./mass_spectra_analysis.py --output output_dir --regions regions.align example_HC_chymo_CID.mzid example_HC_trypsin_CID.mzid

4.4. Output files:

Statistics:

metrics.txt - file with basic statistics for each of given mass spectrum alignments.
covered_cdrs.txt - file with information about number of sequences with at least one peptide spectrum match on corresponding region.
psm_on_ig_regions.txt - file with information about number of peptide spectrum matches aligned to corresponding regions of sequences.

Statistics visualization:

PSM_cov.png - PNG file with plot showing coverage by peptide spectrum matches along antibody sequence.
peptide_cov.png - PNG file with plot showing coverage by peptide spectrum matches along antibody sequence, consider only scans with unique alignment to database.
PSM_per_scan.png - PNG file with histogram of distribution of number of peptide spectrum matches per scan.
peptide_length.png - PNG file with histogram of distribution of peptide length.

5. Examples

Example shows IgRepertoireConstructor pipeline in action for merged paired-end Illumina MiSeq library including reads reads.fastq and mass spectra AspN_CID.mzXML corresponding to the same antibody repertoire.

► To run IgReC with standard settings, type the following command:

    
    ./igrec.py -s reads.fastq -o repertoire_constructing

Sequences of the constructed repertoire are located in repertoire_constructing/constructed_repertoire.clusters.fa. They can be converted into amino acid sequences and used as a database for matching mass spectra AspN_CID.mzXML (e.g., using MS-GF+ tool). Let result of MS-GF+ tool be a file AspN_CID.mzId.

► To run MassSpectraAnalyzer on AspN_CID.mzId file, type the following command:

    
    ./mass_spectra_analyzer.py -o ms_analysis AspN_CID.mzId

Statistics for mass spectra alignment can be found in ms_analysis directory.

6. Antibody repertoire representation

We used two files for representation of repertoire for the set of clustered reads: CLUSTERS.FASTA and RCM.

6.1. CLUSTERS.FASTA file format

CLUSTERS.FASTA is a FASTA file, where sequences correspond to the assembled antibodies. Each header contains information about corresponding antibody cluster (id and size):

    
    >cluster___1___size___3
    CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGGACG
    >cluster___2___size___2
    CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGG
    >cluster___3___size___1
    CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGGAC

6.2. RCM file format

Every line of RCM (read-cluster map) file contains information about read name and corresponding cluster ID:

    
    MISEQ@:53:000000000-A2BMW:1:2114:14345:28882    1
    MISEQ@:53:000000000-A2BMW:1:2114:14374:28884    1
    MISEQ@:53:000000000-A2BMW:1:2114:14393:28886    1
    MISEQ@:53:000000000-A2BMW:1:2114:16454:28882    2
    MISEQ@:53:000000000-A2BMW:1:2114:16426:28886    2
    MISEQ@:53:000000000-A2BMW:1:2114:15812:28886    3

Reperoire described in the example above consists of three antibodies. E.g., the antibody with ID 1 has abundancy 3, since it was constructed from three reads:
MISEQ@:53:000000000-A2BMW:1:2114:14345:28882
MISEQ@:53:000000000-A2BMW:1:2114:14374:28884
MISEQ@:53:000000000-A2BMW:1:2114:14393:28886

NOTE: IDs in CLUSTERS.FASTA and RCM files are consistent.

6.3 Alignment Info file format

File alignment_info.csv contains the following information about the closest V and J gene segments in tab-separated view.
Read ids are consistent with headers in file cleaned_reads.fastq.
Ids of V and J gene segments are taken from IMGT database.

Read id	V start	V end	V score (% identity)	V id	J start	J end	J score (% identity)	J id
read1	1	296	100.0	IGHV3-20*01	321	366	89.0	IGHJ5*02
read2	1	294	98.64	IGHV3-9*01	309	354	100.0	IGHJ2*01
...	...	...	...	...	...	...	...	...

7. Feedback and bug reports

Your comments, bug reports, and suggestions are very welcome. They will help us to further improve IgRepertoireConstructor.

If you have any trouble running IgRepertoireConstructor, please send us the log file from the output directory.

Address for communications: igtools_support@googlegroups.com.

7.1. Citation

If you use IgRepertoireConstructor in your research, please refer to Safonova et al., 2015.