IgSimulator 1.0 manual

1. What are IgSimulator?
2. Installation
    2.1. Verifying your installation
3. IgSimulator
    3.1. Basic options
    3.2. Ig genes options
    3.3. Advanced options
    3.4. Examples
    3.5. Output files
4. Antibody repertoire representation
    4.1.
CLUSTERS.FASTA file format
    4.2. RCM file format
5. Feedback and bug reports

1. What is IgSimulator?

IgSimulator is a tool for simulation of antibody repertoire and Ig-Seq library. IgSimulator is designed for testing and benchmarking tools for reconstruction of Ig repertoires.

2. Installation

IgSimulator requires the following pre-installed dependencies: To install IgSimulator, type:
    
    make
    

2.1. Verifying your installation

For testing purposes, IgSimulator comes with a toy data set.

► To try IgSimulator on test data set, run:

    ./ig_simulator.py --test

If the installation is successful, you will find the following information at the end of the log:
    
    ======== IgSimulator ends

    Main output files:
    * Sequences of simulated repertoire were written to <igtools_installation_directory>/ig_simulator_test/repertoire.fasta
    * Simulated merged reads were written to <igtools_installation_directory>/ig_simulator_test/merged_reads.fastq
    * CLUSTERS.FA for simulated repertoire were written to <igtools_installation_directory>/ig_simulator_test/ideal_repertoire.clusters.fa
    * RCM for simulated repertoire were written to <igtools_installation_directory>/ig_simulator_test/ideal_repertoire.rcm

    Thank you for using IgSimulator!

    Log was written to <igtools_installation_directory>/ig_simulator_test/ig_repertoire_simulation.log

    

3. IgSimulator

IgSimulator tool takes parameters of the simulation as an input and constructs reference heavy chain repertoire, corresponding Illumina library and ideal repertoire.

Command line:

    ./ig_simulator.py [options] --chain-type TYPE --num-bases N1 --num-mutated N2 --repertoire-size N3 -o <output_dir>

3.1. Basic options:

-o <output_dir>
output directory (required).

--num-bases <int>
number of base sequences (required).

--num-mutated <int>
expected number of mutated sequences (required).

--repertoire-size <int>
expected reference repertoire size (required).

--chain-type HC or LC
type of chain that can be used for repertoire simulation. Default value is 'HC'.

--test
runs toy test data set (see Section 3.4). Command line corresponding to the test run is equivalent to the following line:
    
    ./ig_simulator.py --num-bases 10 --num-mutated 50 --repertoire-size 1000 -o ig_repertoire_simulator_test 
    

3.2. Ig genes options:

--vgenes <filename>
FASTA file with Ig germline V genes. Default value is <igtools_installation_directory>/data/human_ig_germline_genes/human_IGHV.fa for heavy chain repertoire and <igtools_installation_directory>/data/human_ig_germline_genes/human_IGKV.fa for light chain repertoire.

--dgenes <filename>
FASTA file with Ig germline D genes. Default value is <igtools_installation_directory>/data/human_ig_germline_genes/human_IGHD.fa for heavy chain repertoire.

--jgenes <filename>
FASTA file with Ig germline J genes. Default value is <igtools_installation_directory>/data/human_ig_germline_genes/human_IGHJ.fa for heavy chain repertoire and <igtools_installation_directory>/data/human_ig_germline_genes/human_IGKJ.fa for light chain repertoire.

--db-type imgt or reg
Type of dababase. By default, this parameter has 'imgt' value and means that headers of FASTA files with V, D, and J genes are consistent with IMGT format, for example:
    
    >M99641|IGHV1-18*01|Homo sapiens|F|V-REGION|188..483|296 nt|1| | | | |296+0=296| | |
    CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCC
    CTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACAT
    GGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA
    
In this case, gene segment name specified after the first '|' symbol (IGHV1-18*01) will be used in output files containing V(D)J recombination (see Output files for more details).
If your database is not in IMGT format, please specify 'reg' value for this option. In this case, entire sequences specified in headers will be used as gene segment names.

3.3. Advanced options:

--skip-drawing
skips visualization of statistics for merged reads. Default value is false.

--help
prints help.

3.4. Examples

To simulate heavy chain data set with 100 base sequences, ~500 mutated sequences and ~1500 sequences in the final repertoire size and, correspondingly, simulated Illumina library, run the following command:
    
    ./ig_simulator.py  --chain-type HC --num-bases 100 --num-mutated 500 --repertoire-size 1500 -o ig_simulator_test 
    
If you want to additionally specify paths to V/D/J germline genes instead of using default IMGT database:
    
    ./ig_simulator.py --chain-type HC --num-bases 100 --num-mutated 500 --repertoire-size 1500 \\
        --VH <path_to_your_vgenes.fasta> --VD <path_to_your_dgenes.fasta> --JH <path_to_your_jgenes.fasta> -o ig_simulator_test 
    

3.5. Output files

IgSimulator tool creates working directory (which name was specified using option -o) and writes there the following files:

4. Antibody repertoire representation

We used two formats of files for representation of repertoire for the set of reads: CLUSTERS.FASTA and RCM.

4.1. CLUSTERS.FASTA file format

CLUSTERS.FASTA is a FASTA file, where each sequence corresponds to the monoclonal antibody and header of sequence contains information about corresponding cluster (set of input reads related to the same monoclonal antibody) id and size:
    
    >cluster___1___size___3
    CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGGACG
    >cluster___2___size___2
    CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGG
    >cluster___3___size___1
    CCCCTGCAATTAAAATTGTTGACCACCTACATACCAAAGACGAGCGCCTTTACGCTTGCCTTTAGTACCTCGCAACGGCTGCGGAC
    

4.2. RCM file format

Every line of RCM (read-cluster map) file contains information about read name and corresponding cluster id:
    
    MISEQ@:53:000000000-A2BMW:1:2114:14345:28882    1
    MISEQ@:53:000000000-A2BMW:1:2114:14374:28884    1
    MISEQ@:53:000000000-A2BMW:1:2114:14393:28886    1
    MISEQ@:53:000000000-A2BMW:1:2114:16454:28882    2
    MISEQ@:53:000000000-A2BMW:1:2114:16426:28886    2
    MISEQ@:53:000000000-A2BMW:1:2114:15812:28886    3
    

NOTE: ids in CLUSTERS.FASTA and RCM files should be consistent.

5. Feedback and bug reports

Your comments, bug reports, and suggestions are very welcomed. They will help us to further improve IgSimulator.

If you have any troubles running IgSimulator, please send us log file from output output directory.

Address for communications: igtools_support@googlegroups.com.