IgRepertoireConstructor 2.1 manual
1. What is IgRepertoireConstructor?
1.1 About IgReC
1.2 About MassSpectraAnalyzer
2. Installation
2.1. Verifying your installation
3. IgReC usage
3.1. Basic options
3.2. Advanced options
3.3. Examples
3.4. Output files
4. MassSpectraAnalyzer usage
4.1. Basic options
4.2. Advanced options
4.3. Examples
4.4. Output files
5. Examples
6. Antibody repertoire representation
6.1. CLUSTERS.FASTA file format
6.2. RCM file format
6.3. Alignment Info file format
7. Feedback and bug reports
7.1. Citation
1. What is IgRepertoireConstructor?
IgRepertoireConstructor is a tool for construction of antibody repertoire and immunoproteogenomics analysis.
IgRepertoireConstructor pipeline consists of two parts:
- IgReC — a tool for construction of full-length antibody repertoire from Illumina Ig-Seq library.
- MassSpectraAnalyzer — a tool for analysis of matching mass spectra against constructed repertoire.
About IgReC
IgReC pipeline in shown below:
Input:
IgReC takes as an input paired-end or single Illumina reads.
Please note that IgRepertoireConstructor constructs full-length repertoire and
expects that input reads cover variable region of antibody.
Output:
IgReC corrects sequencing and amplification errors and joins together reads corresponding to identical antibodies.
Thus, constructed repertoire is a set of antibody clusters characterizing by
sequence and multiplicity.
IgReC provides user with the following information about constructed repertoire:
- antibody clusters: antibody sequences with multiplicities,
- read-cluster map: information showing how reads form antibody clusters,
- highly abundant antibody clusters,
- super reads: groups of identical input reads with high coverage.
Stages:
IgReC pipeline consists of the following steps:
- VJ Finder: cleaning input reads using alignment against Ig germline genes
- HG Constructor: construction of Hamming graph on cleaned reads
- Dense Subgraph Finder: finding dense subgraphs (or corrupted cliques) in the constructed Hamming graph.
This allows us to decompose cleaned reads into highly similar groups.
DSF algorithm is also available as a separate tool.
Please find DSF manual here.
- Antibody Constructor: construction of antibody clusters based on graph decomposition found at the previous stage.
You can find details of IgReC algorithm in our paper.
About MassSpectraAnalyzer
Some immunological experiments include preparation both sequencing reads and mass spectra (see examples in Cheung et al, 2012, Nature; Safonova, Bonissone et al, Bioinformatics, 2015).
In this case, mass spectra datasets can be used for validation of repertoire constructed from sequencing reads.
Repertoire constructed by IgReC can be used as a database for
identification of mass spectra using some standard tool, e.g., MS-GF+.
MassSpectraAnalyzer takes as an input mzIdentML file and
computes similarity between constructed repertoire and mass spectra.
2. Installation
IgRepertoireConstructor has the following dependencies:
- 64-bit Linux system
- g++ (version 4.7 or higher)
- cmake (version 2.8.8 or higher)
- Python 2 (version 2.7 or higher), including:
To install IgRepertoireConstructor, type:
make
2.1. Verifying your installation
For testing purposes, IgReC and MassSpectraAnalyzer come with toy data sets.
► To try IgReC on the test data set, run:
./igrec.py --test
If the installation of IgReC is successful, you will find the following information at the end of the log:
Thank you for using IgReC!
Log was written to igrec_test/ig_repertoire_constructor.log
► To try MassSpectraAnalyzer on test data set, run:
./mass_spectra_analyzer.py --test
If the installation of MassSpectraAnalyzer is successful, you will find the following information at the end of the log:
Spectra processed: example_HC_chymo_CID.mzid.spectra, example_HC_trypsin_CID.mzid.spectra
Metrics written to <your_installation_directory>/ms_analyzer_test/metrics.txt
Covered CRDs written to <your_installation_directory>/ms_analyzer_test/covered_cdrs.txt
PSM on IG regions written to <your_installation_directory>/ms_analyzer_test/psm_on_ig_regions.txt
Figures and statistics saved in <your_installation_directory>/ms_analyzer_test
3. IgReC usage
IgReC takes as an input Illumina reads covering variable region of antibody and constructs repertoire
in CLUSTERS.FA and RCM format.
To run IgReC, type:
./igrec.py [options] -s <single_reads.fastq> -o <output_dir>
OR
./igrec.py [options] -1 <left_reads.fastq> -2 <right_reads.fastq> -o <output_dir>
3.1. Basic options:
-s <single_reads.fastq>
FASTQ file with single Illumina reads (required).
-1 <left_reads.fastq> -2 <right_reads.fastq>
FASTQ files with paired-end Illumina reads (required).
-o / --output <output_dir>
Output directory (required).
-t / --threads <int>
The number of parallel threads. The default value is 16
.
--test
Running on the toy test dataset. Command line corresponding to the test run is equivalent to the following:
./igrec.py -s test_dataset/merged_reads.fastq -l all -o igrec_test
--help
Printing help.
3.2. Advanced options:
--loci / -l <str>
Immunological loci to align input reads and discard reads with low score (required).
Available values are IGH
/ IGL
/ IGK
/ IG
(for all BCRs) /
TRA
/ TRB
/ TRG
/ TRD
/ TR
(for all TCRs) or all
.
--no-pseudogenes
Do not use pseudogenes along with normal gene segments for VJ alignment.
By default, IgReC uses pseudogenes for aligning reads.
--organism <str>
Organism. Available values are human
, mouse
, pig
,
rabbit
, rat
and rhesus_monkey
.
Default value is human
.
--tau <int>
Maximum allowed number of mismatches between two reads corresponding to identical antibodies. The default (and recommended) value is 4.
Higher values give higher sensitivity of error correction algorithm but increase running time.
Reasonable value of tau
lies between 1
and 6
.
-n / --min-sread-size <int>
Minimum size of super read. Super read is a group of identical input reads with high coverage.
IgReC considers that super reads present error free clusters and does not glue them together.
If input data set was highly amplified, we recommend to increase value of this option.
Default value is 5
.
--min-cluster-size <int>
Minimal size of antibody cluster used for output of large clusters.
Default value is 5
.
3.3. Examples
To construct antibody repertoire from single reads reads.fastq
, type:
./igrec.py -s reads.fastq -o output_dir
3.4. Output files
IgReC creates working directory (which name was specified using option -o
)
and outputs the following files there:
- Final repertoire files:
- final_repertoire.fa — CLUSTERS.FASTA file for all antibody clusters of the constructed repertoire
(details in Antibody repertoire representation).
- final_repertoire_large.fa — CLUSTERS.FASTA file for highly abundant antibody clusters of the constructed repertoire
(minimal cluster size is defined by option
--min-size
)
- final_repertoire.rcm — RCM file for the constructed repertoire
(details in Antibody repertoire representation).
- super_reads.fa — FASTA file containing super reads, i.e., large groups of identical input reads,
Minimal size of super read is defined by option
--min-sread-size
.
- VJ finder output:
-
vj_finder/cleaned_reads.fa — FASTA file with cleaned reads constructed at the VJ Finder stage.
Cleaned reads have forward direction (from V to J),
contain V and J gene segments and are cropped by the left bound of V gene segment.
-
vj_finder/filtered_reads.fa — FASTA file with filtered reads.
Filtered reads have bad alignment to Ig germline gene segments and are likely to present contaminations.
-
vj_finder/alignment_info.csv — CSV file containing information about alignment of cleaned reads to
V and J gene segments.
Details of alignment_info.csv format are given in Alignment Info file format.
- igrec.log — full log of IgReC run.
4. MassSpectraAnalyzer usage
MassSpectraAnalyzer takes as an input result of matching of mass spectra against the constructed repertoire in
mzIdentML 1.1 format (e.g., generated by MS-GF+) and computes multiple statistics showing coverage of the constructed repertoire by mass spectra.
► To run MassSpectraAnalyzer, type:
./mass_spectra_analyzer.py [options] -o <output_dir> input_file_1.mzid ... input_file_N.mzid
4.1. Basic options:
input_file_1.mzid ... input_file_N.mzid
Input files with mass spectra alignment to protein database in mzIdentML 1.1 format.
-o <output_dir>
output directory (required).
--test
Running on the toy test data set.
--help, -h
Printing help.
4.2. Advanced options:
--regions <filename>
File with information about framework and CDRs for protein sequences from used database in IgBLAST format.
Example of file with labeled regions is given below:
Query= Antibody_sequence_1
CDR2-IMGT 51 58
FR2-IMGT 34 50
FR1-IMGT 1 25
FR3-IMGT 59 94
CDR1-IMGT 26 33
CDR3-IMGT 95 110
Query= Antibody_sequence_2
CDR2-IMGT 51 58
FR2-IMGT 34 50
FR1-IMGT 1 25
FR3-IMGT 59 94
CDR1-IMGT 26 33
CDR3-IMGT 95 111
where Antibody_sequence_1 and Antibody_sequence_2 are sequences from database.
4.3. Examples:
► To compute statistics for both chymo and trypsin mass spectra datasets and labeled regions, run the following command:
./mass_spectra_analysis.py --output output_dir --regions regions.align example_HC_chymo_CID.mzid example_HC_trypsin_CID.mzid
4.4. Output files:
- Statistics:
- metrics.txt - file with basic statistics for each of given mass spectrum alignments.
- covered_cdrs.txt - file with information about number of sequences with at least one peptide spectrum match on corresponding region.
- psm_on_ig_regions.txt - file with information about number of peptide spectrum matches aligned to corresponding regions of sequences.
- Statistics visualization:
- PSM_cov.png - PNG file with plot showing coverage by peptide spectrum matches along antibody sequence.
- peptide_cov.png - PNG file with plot showing coverage by peptide spectrum matches along antibody sequence, consider only scans with unique alignment to database.
- PSM_per_scan.png - PNG file with histogram of distribution of number of peptide spectrum matches per scan.
- peptide_length.png - PNG file with histogram of distribution of peptide length.
5. Examples
Example shows IgRepertoireConstructor pipeline in action for merged paired-end Illumina MiSeq library including reads reads.fastq and mass spectra AspN_CID.mzXML corresponding to the same antibody repertoire.
► To run IgReC with standard settings, type the following command:
./igrec.py -s reads.fastq -o repertoire_constructing
Sequences of the constructed repertoire are located in repertoire_constructing/constructed_repertoire.clusters.fa. They can be converted into amino acid sequences and used as a database for matching mass spectra AspN_CID.mzXML (e.g., using MS-GF+ tool). Let result of MS-GF+ tool be a file AspN_CID.mzId.
► To run MassSpectraAnalyzer on AspN_CID.mzId file, type the following command:
./mass_spectra_analyzer.py -o ms_analysis AspN_CID.mzId
Statistics for mass spectra alignment can be found in ms_analysis directory.
6. Antibody repertoire representation
We used two files for representation of repertoire for the set of clustered reads: CLUSTERS.FASTA and RCM.
6.1. CLUSTERS.FASTA file format
CLUSTERS.FASTA is a FASTA file, where sequences correspond to the assembled antibodies.
Each header contains information about corresponding antibody cluster (id and size):
>cluster___1___size___3
CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGGACG
>cluster___2___size___2
CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGG
>cluster___3___size___1
CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGGAC
6.2. RCM file format
Every line of RCM (read-cluster map) file contains information about read name and corresponding cluster ID:
MISEQ@:53:000000000-A2BMW:1:2114:14345:28882 1
MISEQ@:53:000000000-A2BMW:1:2114:14374:28884 1
MISEQ@:53:000000000-A2BMW:1:2114:14393:28886 1
MISEQ@:53:000000000-A2BMW:1:2114:16454:28882 2
MISEQ@:53:000000000-A2BMW:1:2114:16426:28886 2
MISEQ@:53:000000000-A2BMW:1:2114:15812:28886 3
Reperoire described in the example above consists of three antibodies. E.g., the antibody with ID 1 has abundancy 3, since it was constructed from three reads:
MISEQ@:53:000000000-A2BMW:1:2114:14345:28882
MISEQ@:53:000000000-A2BMW:1:2114:14374:28884
MISEQ@:53:000000000-A2BMW:1:2114:14393:28886
NOTE: IDs in CLUSTERS.FASTA and RCM files are consistent.
6.3 Alignment Info file format
File alignment_info.csv contains the following information about the closest V and J gene segments
in tab-separated view.
Read ids are consistent with headers in file cleaned_reads.fastq.
Ids of V and J gene segments are taken from IMGT database.
Read id |
V start |
V end |
V score (% identity) |
V id |
J start |
J end |
J score (% identity) |
J id |
read1 | 1 | 296 | 100.0 | IGHV3-20*01 | 321 |
366 | 89.0 | IGHJ5*02 |
read2 | 1 | 294 | 98.64 | IGHV3-9*01 | 309 |
354 | 100.0 | IGHJ2*01 |
... | ... | ... | ... | ... | ... |
... | ... | ... |
7. Feedback and bug reports
Your comments, bug reports, and suggestions are very welcome.
They will help us to further improve IgRepertoireConstructor.
If you have any trouble running IgRepertoireConstructor, please send us the log file from the output directory.
Address for communications: igtools_support@googlegroups.com.
7.1. Citation
If you use IgRepertoireConstructor in your research, please refer to
Safonova et al., 2015.