Rep-seq or full-length sequencing of adaptive immune repertoires

Modern sequencing technologies (e.g., Illumina MiSeq) allow biologists to perform full-length scanning of adaptive immune repertoire. For example, a pair of overlapping paired-end Illumina MiSeq reads (250x2 or 300x2) is able to cover a variable region of antibody or TCR:

antibody_sequencing

Our tools for analysis of adaptive immune repertoires

Sequencing data of adaptive immune repertoires or Rep-seq is an input of various immunological studies. We propose a set of tools for solving related immunoinformatics problems:

pipeline
Y-tools in a nutshell

Construction of full-length adaptive immune repertoires

Repertoire construction problem

Repertoire construction is a preliminary step of any immunological analysis based on Rep-seq reads. Accurate construction of adaptive immune repertoire (antibody / TCR) avoids erroneous analysis of natural variations and antibody abundances.

Repertoire construction problem can be formulated as a huge instance of error-correction or read clustering problem:

pipeline

Input / output

A repertoire construction tool takes full-length Rep-seq reads as an input and constructs a repertoire as a set of clusters. Each cluster represent a group of reads corresponding to identical antibody (or TCRs) chains. Clusters in a repertoire are characterized by sequence and abundance. IgReC reports repertoires in CLUSTERS.FASTA and RCM formats.


IgReC

IgReC takes as an input reads (paired-end or single) covering variable region of antibodies and corrects sequencing and amplification errors. Algorithm performs the following steps:

VJ Finder

The main goals of VJ Finder tool are:

  1. Finding and discarding contaminated reads.
  2. Cropping remaining immunosequencing reads by the start position of the closest V gene segment and the end position of the closest J gene segment. VJ Finder also discards reads covering V gene segment only since they can not be unambiguously assigned to a cluster in a repertoire.

VJ Finder outputs information about the closest V and J gene segments in tab-separated view:

Read id V start V end V score
(% identity)
V id J start J end J score
(% identity)
J id
read1 1 296 100.0 IGHV3-20*01 321 366 89.0 IGHJ5*02
read2 1 294 98.64 IGHV3-9*01 309 354 100.0 IGHJ2*01
... ... ... ... ... ... ... ... ...

Constructing & clustering Hamming graph

IgReC constructs Hamming graph to identify similar Rep-seq reads. Vertices of Hamming graph represent unique reads. An edge connects read1 and read2 if Hamming distance between these them is relatively small.

After construction of graph, IgReC launches DSF for finding dense subgraphs. Dense subgraphs correspond to groups of highly similar antibodies:

hamming_graph_before_clusterization hamming_graph_after_clusterization
Example of connected component of Hamming graph constructed from immunosequencing reads (heavy chain human repertoire). Clustering of Hamming graph performed by DSF. Each color corresponds to antibody cluster in repertoire.

Mass Spectra Analyzer

This step performs immunoproteogenomics analysis to validate constructed repertoire using mass spectra. It takes alignment of mass spectra against constructed repertoire in mzIdentML 1.1 format (e.g., generated by MS-GF+) as an input and outputs a set of metrics and plots that show similarity between constructed repertoire and mass spectra:

peptide_length_distribution psm_coverage_histogram
Histogram of peptide length distribution. Histogram of variable region coverage by PSMs. Grey bars correspond to expected positions of CDRs.

Manual and citations

Citations

If you use our tools in your research, please cite our papers: