IgSimulator 1.0 manual
1. What are IgSimulator?
2. Installation
2.1. Verifying your installation
3. IgSimulator
3.1. Basic options
3.2. Ig genes options
3.3. Advanced options
3.4. Examples
3.5. Output files
4. Antibody repertoire representation
4.1. CLUSTERS.FASTA file format
4.2. RCM file format
5. Feedback and bug reports
1. What is IgSimulator?
IgSimulator
is a tool for simulation of antibody repertoire and Ig-Seq library.
IgSimulator
is designed for testing and benchmarking tools for reconstruction of Ig repertoires.
2. Installation
IgSimulator requires the following pre-installed dependencies:
- 64-bit Linux system
- g++ (version 4.7 or higher)
- Python (version 2.7 or higher)
- Additional Python modules
To install IgSimulator
, type:
make
2.1. Verifying your installation
For testing purposes, IgSimulator comes with a toy data set.
► To try IgSimulator
on test data set, run:
./ig_simulator.py --test
If the installation is successful, you will find the following information at the end of the log:
======== IgSimulator ends
Main output files:
* Sequences of simulated repertoire were written to <igtools_installation_directory>/ig_simulator_test/repertoire.fasta
* Simulated merged reads were written to <igtools_installation_directory>/ig_simulator_test/merged_reads.fastq
* CLUSTERS.FA for simulated repertoire were written to <igtools_installation_directory>/ig_simulator_test/ideal_repertoire.clusters.fa
* RCM for simulated repertoire were written to <igtools_installation_directory>/ig_simulator_test/ideal_repertoire.rcm
Thank you for using IgSimulator!
Log was written to <igtools_installation_directory>/ig_simulator_test/ig_repertoire_simulation.log
3. IgSimulator
IgSimulator
tool takes parameters of the simulation as an input and constructs reference heavy chain repertoire, corresponding Illumina library and ideal repertoire.
Command line:
./ig_simulator.py [options] --chain-type TYPE --num-bases N1 --num-mutated N2 --repertoire-size N3 -o <output_dir>
3.1. Basic options:
-o <output_dir>
output directory (required).
--num-bases <int>
number of base sequences (required).
--num-mutated <int>
expected number of mutated sequences (required).
--repertoire-size <int>
expected reference repertoire size (required).
--chain-type HC or LC
type of chain that can be used for repertoire simulation. Default value is 'HC'.
--test
runs toy test data set (see Section 3.4). Command line corresponding to the test run is equivalent to the following line:
./ig_simulator.py --num-bases 10 --num-mutated 50 --repertoire-size 1000 -o ig_repertoire_simulator_test
3.2. Ig genes options:
--vgenes <filename>
FASTA file with Ig germline V genes. Default value is <igtools_installation_directory>/data/human_ig_germline_genes/human_IGHV.fa
for heavy chain repertoire and <igtools_installation_directory>/data/human_ig_germline_genes/human_IGKV.fa
for light chain repertoire.
--dgenes <filename>
FASTA file with Ig germline D genes. Default value is <igtools_installation_directory>/data/human_ig_germline_genes/human_IGHD.fa
for heavy chain repertoire.
--jgenes <filename>
FASTA file with Ig germline J genes. Default value is <igtools_installation_directory>/data/human_ig_germline_genes/human_IGHJ.fa
for heavy chain repertoire and <igtools_installation_directory>/data/human_ig_germline_genes/human_IGKJ.fa
for light chain repertoire.
--db-type imgt or reg
Type of dababase. By default, this parameter has 'imgt' value and means that headers of FASTA files with V, D, and J genes are consistent with IMGT format, for example:
>M99641|IGHV1-18*01|Homo sapiens|F|V-REGION|188..483|296 nt|1| | | | |296+0=296| | |
CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCC
CTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACAT
GGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA
In this case, gene segment name specified after the first '|' symbol (IGHV1-18*01) will be used in output files containing V(D)J recombination (see Output files for more details).
If your database is not in IMGT format, please specify 'reg' value for this option. In this case, entire sequences specified in headers will be used as gene segment names.
3.3. Advanced options:
--skip-drawing
skips visualization of statistics for merged reads. Default value is false
.
--help
prints help.
3.4. Examples
To simulate heavy chain data set with 100 base sequences, ~500 mutated sequences and ~1500 sequences in the final repertoire size and, correspondingly, simulated Illumina library, run the following command:
./ig_simulator.py --chain-type HC --num-bases 100 --num-mutated 500 --repertoire-size 1500 -o ig_simulator_test
If you want to additionally specify paths to V/D/J germline genes instead of using default IMGT database:
./ig_simulator.py --chain-type HC --num-bases 100 --num-mutated 500 --repertoire-size 1500 \\
--VH <path_to_your_vgenes.fasta> --VD <path_to_your_dgenes.fasta> --JH <path_to_your_jgenes.fasta> -o ig_simulator_test
3.5. Output files
IgSimulator
tool creates working directory (which name was specified using option -o
) and writes there the following files:
- Files with sequences
- final_repertoire.fasta - FASTA file with simulated antibody repertoire that will be used as reference for Illumina library simulation.
- paired_reads1.fq - FASTQ file with left reads constructed using ART read simulator. Reads correspond to simulated Illumina MiSeq library.
- paired_reads2.fq - FASTQ file with right reads constructed using ART read simulator. Reads correspond to simulated Illumina MiSeq library.
- merged_reads.fastq - FASTQ file consructed as result of merging left and right files with reads. This file is expected to be input for
IgRepertoireConstruction
tool.
- reads_vdj_recombination.txt contains information about V(D)J recombination for each read from merged_reads.fastq file. Example of reads_vdj_recombination.txt file is given below:
34_merged_read_antibody_20_multiplicity_1_copy_1-1/1 IGHV3-48*02;IGHD3-16*02;IGHJ6*02
53_merged_read_antibody_34_multiplicity_1_copy_1-1/1 IGHV4-28*02;IGHD3-10*01;IGHJ6*03
59_merged_read_antibody_37_multiplicity_2_copy_2-1/1 IGHV4-28*02;IGHD3-10*01;IGHJ6*03
8_merged_read_antibody_9_multiplicity_4_copy_1-1/1 IGHV3-66*01;IGHD3-22*01;IGHJ6*03
Files with statistics of the simulated repertoire:
- base_sequences.fasta contains sequences of base repertoire.
- base_frequencies.txt contains frequencies of base sequences.
- mutated_sequences.fasta contains sequences of mutated repertoire.
- mutated_frequencies.txt contains frequencies of mutated sequences.
- shm_positions.txt contains information about all introduced somatic hypermutations.
Each line corresponds to one mutation and of this file includes two field (separated by 'tab'): 'mutation position' and 'sequence length'.
- repertoire_vdj_recombination.txt contains information about V(D)J recombination for each constructed antibody. Example of repertoire_vdj_recombination.txt file is given below:
antibody_1 IGHV3-13*01;IGHD3-3*02;IGHJ4*02
antibody_2 IGHV4-30-4*02;IGHD4-17*01;IGHJ4*03
antibody_3 IGHV4-30-4*02;IGHD4-17*01;IGHJ4*03
antibody_4 IGHV3-13*02;IGHD3-10*01;IGHJ5*01
Visialization of the statistics for the simulated repertoire
- base_seq_lens.png - PNG file with histogram of base sequences lengths distribution.
If number of base sequences (controlled by option
--num-bases
) is enough large, distribution is expected to be normal.
This file is created based on statistics from base_repertoire.stats.
- base_seq_freqs.png - PNG file with histogram of base sequence frequencies distribuition.
This file is created based on statistics from base_multiplicities.txt.
- mutated_seq_freqs.png - PNG file with histogram of mutated sequence frequencies distribuition in final repertoire.
This file is created based on statistics from mutated_multiplicities.txt.
- shm_positions.png - PNG file with histogram of distribution of somatic hypermutations relative positions.
This file is created based on statistics from shm_positions.txt.
- paired_reads1.aln and paired_reads2.aln show alignment of paired-end reads to reference repertoire.
Files are generated by ART read simulator.
Files described ideal repertoire (see details in section 4):
- ideal_repertoire.clusters.fasta - CLUSTERS.FASTA file corresponding ideal clusters for merged_reads.fastq.
- ideal_repertoire.rcm - RCM file corresponding ideal clusters for merged_reads.fastq. This file can be used as ideal read-cluster map in
IgQUAST
tool.
ig_simulator.log - full log of IgSimulator
run.
4. Antibody repertoire representation
We used two formats of files for representation of repertoire for the set of reads: CLUSTERS.FASTA and RCM.
4.1. CLUSTERS.FASTA file format
CLUSTERS.FASTA is a FASTA file, where each sequence corresponds to the monoclonal antibody and header of sequence contains information about corresponding cluster (set of input reads related to the same monoclonal antibody) id and size:
>cluster___1___size___3
CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGGACG
>cluster___2___size___2
CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGG
>cluster___3___size___1
CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGGAC
4.2. RCM file format
Every line of RCM (read-cluster map) file contains information about read name and corresponding cluster id:
MISEQ@:53:000000000-A2BMW:1:2114:14345:28882 1
MISEQ@:53:000000000-A2BMW:1:2114:14374:28884 1
MISEQ@:53:000000000-A2BMW:1:2114:14393:28886 1
MISEQ@:53:000000000-A2BMW:1:2114:16454:28882 2
MISEQ@:53:000000000-A2BMW:1:2114:16426:28886 2
MISEQ@:53:000000000-A2BMW:1:2114:15812:28886 3
NOTE: ids in CLUSTERS.FASTA and RCM files should be consistent.
5. Feedback and bug reports
Your comments, bug reports, and suggestions are very welcomed.
They will help us to further improve IgSimulator.
If you have any troubles running IgSimulator, please send us log file from output output directory.
Address for communications: igtools_support@googlegroups.com.