Motivation and purposes
Construction of an antibody repertoire is a preliminary step of any immunological analysis based on Repseq reads. In the last few years at least three tools were released to address this problem: IgRepertoireConstructor by Safonova et al (2015), MiXCR by Bolotin et al. (2015) and pRESTO by Vander Heiden et al. (2014). However, no conventional quality assessment methodology was proposed yet. Without such methodology we cannot compare tools and determine their limitations.
We developed IgQUAST (ImmunoGlobulin QUality ASsesstment Tool), a tool for benchmarking of fulllength adaptive immune repertoire construction tools and quality assessment of adaptive immune repertoires.
IgQUAST has the following features (explained in details in further sections):
 Referencebased and referencefree repertoire quality assessment;
 Overcorrection/undercorrection detection;
 Analysis of antibody abundances;
 Detection of sequencing technology and repertoire construction strategy artifacts.
Quality assessment scenarios
IgQUAST input:
 Initial Repseq read library;
 Analyzed adaptive immune repertoire constructed on this library;
 For referencebased analysis, the reference repertoire constructed on the same library.
 Repertoire sequences along with their abundances (the number of reads formed the cluster);
 Map from initial reads to the corresponding clusters (readtocluster map, RCM).
Note: IgQUAST benchmarks only the repertoire construction stage. Input reads should be already merged, filtered from contaminations, and cropped by V/J ends.
IgQUAST performs referencebased and referencefree analysis:
 During referencebased analysis the tool compares two input repertoires: the reference repertoire and the constructed repertoire.
The analysis is separated into two scenarios:
 Repertoiretorepertoire matching only uses repertoire sequences. The tool aligns each of two repertoires against the other one and computes sensitivity and precision metrics, detects error positions in erroneously constructed sequences, and compares reference and constructed abundances for ideally reconstructed sequences.
 Partitionbased analysis only uses partitions induced by the RCMs (readtocluster maps). The tool compares two partitions and computes partition similarity metrics (like Rand index). Also it computes cluster quality measures (like purity and discordance) and plots their distributions for both input repertoires.
 Referencefree analysis is performed on the constructed repertoire. The tool detects overestimated clusters in the repertoire using amplificationfree Poisson model. It also estimates error rate and error profile of the initial read library. The same analysis is performed on the reference repertoire if it is provided. All referencefree analysis requires both repertoire sequences and readtocluster map (RCM).
The output IgQUAST is a collection of plots along with numerical statistics that are described below. Brief report is outputted to screen and saved to text log file. All computed metrics along with additional information (e.g. the data to reproduce some of plots) are exported into JSON. Table below summarizes scenarios requirements and results.
Scenario  Requirements  Resultant output 
Repertoiretorepertoire matching  Constructed and reference repertoires in CLUSTER.FA format  Plots: Statistics: 
Partitionbased analysis  Initial input Repseq library, RCM files for the reference and the constructed repertoires  Plots: Statistics: 
Referencefree analysis  Initial input Repseq library, constructed repertoire RCM file 
Plots:

Certain scenarios are disabled by default since they are time and memoryconsuming. See IgQUAST manual for command line options description.
Reference repertoire
Reference repertoire is obtained either by repertoire simulation or by construction of the repertoire using the barcoding data and running repertoire construction in the blind (ignoring barcodes) mode.
Simulation
There are no special requirements for simulation software or/and simulated repertoire. However, we suggest using our IgSimulator tool that was specially designed for benchmarking adaptive immune repertoire construction strategies.
Molecular barcoding
Molecular identifiers (UMIs) or barcodes can be used for Bcell labeling. Such labeling allows one to reconstruct adaptive immune sequences using consensus by groups and, thus, provides one with a reference dataset for benchmarking.
Please note that currently available barcoding techniques are not perfect. Thus, using naive barcode assembly approach may lead to imprecise reference and biased analysis. The better option is to use a special tool designed for barcodeassisted antibody repertoire construction. We suggest using our BarIgReC tool for this.
Another approach to consider barcoding artifacts is to use specially designed semiartificial test datasets. Such datasets are obtained by utmost aggressive input read and barcode filtering: all suspicious barcodes and reads are omitted. Such an approach is definitely improper for reallife repertoire construction (a significant part of repertoire diversity is lost), but is appropriate for test dataset preparation.
Cluster abundance threshold
Distribution of antibody abundances is usually highly uneven. Most clusters in a repertoire have small size. Note that consensus computed on a small cluster is unreliable. Additionally, due to the dependence of amplification errors and/or special sequencing errors pattern even larger cluster could be unreliable.
Thus, we have to exclude small clusters from the analysis in order to make results more stable and interpretable. The threshold for reference cluster size (reference minsize) is a parameter of benchmarking procedure. IgQUAST uses 5 as a default value. See the paper for more information about this choice.
At the same time, during reallife repertoire construction one have to choose a threshold for constructed cluster size. Constructed cluster size threshold (constructed minsize) is a parameter of repertoire construction procedure. The selection is tricky since the larger threshold, the less detailed but more trustworthy result. One have to find the optimal (in some sense) threshold value, the concrete optimality criterion depends mostly on the final purpose of the research.
For different repertoire construction strategies optimal (in any sense) threshold values can be different. Due to this reason, IgQUAST computes statistics for various threshold values. However, for good repertoire construction tool that does not underestimate abundances (see the corresponding section for more information about abundance estimation) we can expect that optimal constructed minsize is equal or close to the reference minsize.
Repertoiretorepertoire matching
Let us describe the provided comparison between reference and constructed repertoire in terms of their sequences.
Constructedtoreference and referencetoconstructed distance distributions
For each sequence in the constructed repertoire IgQUAST finds the closest sequence in the reference repertoire and constructs histogram of distances. It also constructs the same histogram by mapping the reference repertoire to the constructed one.
IgQUAST constructs these histograms for different constructed minsize values (1, 3, 5, 10) and for fixed reference minsize (5 by default).
Referencetoconstructed distance distribution; reference minsize = constructed minsize = 5. 
Referencetoconstructed (sensitivity) and constructedtoreference (precision) distance distributions; reference minsize = 5 and various constructed minsizes. 
Note: one should necessarily inspect plots for different minsize values, since small clusters are probably imperfect and should be excluded from the analysis. However, the paramount pair of plots is the third one where constructed minsize = reference minsize. For a good repertoire construction tool we expect optimal constructed minsize to be close to reference minsize.
Plots above suggest that approx. 20% of repertoire sequences possess one error. We will investigate them in greater detail in the next section.
Reported statistics
Extra cluster is a cluster from the constructed repertoire with size more or equal to the constructed min size having not enough abundance (less than the reference minsize) or not presented in the reference repertoire. For the constructed minsize equal to the reference minsize, IgQUAST groups extra clusters by their reference abundance (if cluster is not presented in the reference repertoire, formally define its reference abundance as zero) and reports the frequencies. This diagnostic allows one to estimate the impact of correctly reconstructed lowabundant sequences with overestimated abundance.
For the example above the frequencies are the following:
Extra clusters with size == 4: 57
Extra clusters with size == 3: 63
Extra clusters with size == 2: 36
Extra clusters with size == 1: 75
Extra clusters with size == 0: 4784
That means only a small part of extra clusters are real reference clusters with overestimated size.
Other one are just erroneous.
Error position distribution
For “almost correct” (matched with only one error, i.e. corresponded to the second red bar in the left figure above) constructed clusters IgQUAST plots errorposition distribution (error profile). For this analysis IgQUAST uses constructed minsize equal to the reference minsize, that is 5 by default.
Such analysis can reveal sequencing technology or repertoire construction strategy artifacts.
Error position distribution for “almost correct” constructed sequences. Errors are concentrated in FR4 region. 
After FR4 correction errors are evenly distributed. 
Plot above reveals that errors are mostly concentrated in FR4. Additional analysis shows the key reason for this is extremely high error rate at the end of reads for the considered Repseq library. Fortunately, it could be easily fixed using V/D/J gene germline (e.g. IMGT) database. FR4 is a conservative region and we hopefully do not miss any variations.
Our alignment tool VJ Finder automatically fixes FR1 and FR4 during Repseq reads alignment. Thus, such artifacts is not an issue for IgReC.
Sensitivity and precision
For one predefined reference minsize value (5 by default) and various constructed minsize values IgQUAST finds ideally reconstructed (i.e. matched without errors) clusters and compute the following metrics:
 Precision — rate of ideal clusters among constructed clusters. This is a measure of repertoire clearness. The larger the precision metric, the less the number of false constructed clusters.
 Sensitivity — rate of ideal clusters among reference clusters. This is a measure of repertoire representativity. The larger the sensitivity metric, the more correct clusters were reconstructed.
Sensitivity and precision definition.  Composition of sensitivityprecision plots for three different repertoire construction tools. 
The larger constructed minsize, the less sensitivity (since constructed repertoire is shrinking), and typically the more precision (since larger clusters are more reliable). Therefore, selection of constructed minsize is a tradeoff between sensitivity and precision.
Sensitivityprecision curve (similar to ROCcurve) depicted above helps one to compare different tools. For example above IgReC (blue curve) dominates for all reasonable threshold values. Consequently, IgReC is the best option in this particular case.
Reported statistics
IgQUAST reports: sensitivity and precision for the constructed minsize equal to the reference minsize;
 sensitivity and precision for the optimal (in terms of sum of sensitivity and precision) constructed minsize value along with the optimal value itself;
 area under the sensitivityprecision curve (AUC). AUC could be used is numeric onevalue quality metric.
Abundance distributions
Repertoire clusters are also characterized by their abundances, i.e., cluster sizes. IgReC plots superposed abundance histograms.
Distributions of abundances for both input repertoires.  The same plot in logarithmic scale. 
Note: the histograms are more informative in logscale due to high unevenness of antibody abundance distribution.
The plots above reveal that the constructed repertoire contains about 4 times more small clusters. A repertoire construction tool often fails to fix highly corrupted reads and such reads typically become singletons. Thus, the most part of small constructed clusters have size 1.
Cluster abundances scatterplot
Repertoire sequences are also characterized by their abundances. For sequences presented in both constructed and reference repertoires IgQUAST compares abundances.
A repertoire construction tool often fails to fix highly corrupted reads. Such nonfixed reads typically become singletons. Therefore, abundances of corresponding clusters are commonly underestimated.
According to our experiments, the dependence between reference and constructed cluster size is almost linear. The linear coefficient depends on error rate and repertoire construction strategy. We can use this coefficient to compare different repertoire construction tools.
For certain clusters constructed abundance is much higher than the reference abundance. We suppose that such clusters are overcorrected, i.e., erroneously glued with other (probably smaller) clusters. Thus, their abundance is overestimated. The plot depicts potentially overcorrected clusters by red points, while other clusters are shown by blue ones.
Repertoire construction tool fails to fix highly corrupted reads. Green cluster abundancy is underestimated. Nonfixed reads become singletons.  Cluster abundances scatterplot for a poorly constructed repertoire. Abundances are highly underestimated; also a lot of overcorrected (red) clusters. 
Reported statistics
IgQUAST reports median abundance rate (the median of abundances rates) — an estimation of abundance underestimation coefficient. This metric could be used for repertoire construction tools benchmarking. For the example above median abundance rate equals to 0.83.Partitionbased analysis
Let us describe the provided comparison between reference and constructed repertoire in terms of their clustering partitions. That includes conventional clustering similarity measures and distributions of cluster purity and discordance.
Clustering similarity measures
IgQUAST reports:
 Jaccard index,
 FowlkesMallows index,
 Rand index,
 adjusted Rand index.
Cluster purity and discordance distributions
For each constucted cluster C IgQUAST computes purity and discordance — the rate of the first and second most presented reference cluster in C respectively (see the figure below). For reference clusters purity and discordance are defined symmetrically.
Constructed purity and discordance are the measures of the cluster overcorrection. If the constructed cluster is impure then it is probably combined from several erroneously joined reference clusters.
Reference purity and discordance are measures of the cluster undercorrection. If the reference cluster is impure then it is probably erroneously split into several constructed clusters.
Purity and discordance definition for reference and constructed clusters. Reads 7 and 8 (red) are called discordant for the constructed repertoire, since they vote for the second popular cluster. 
IgQUAST plots histograms of purity and discordance distributions for reference and constructed clusters. As already discussed above, small clusters should be excluded from analysis. Thus, IgQUAST also plots histograms considering large (≥ 5 by default) clusters only.
Plots below demonstrate the difference between poorly constructed and accurate repertoires.
Distribution of purity for large (≥ 5) reference clusters for poorly constructed (left) and accurate (right) repertoire. Left repertoire contain more pure clusters, but also more impure ones. 
Note: even if a constructed cluster is pure it does not necessarily imply that the cluster is correct. A pure cluster may be an erroneously separated part of a bigger cluster. One should control both reference and construction purity.
Distribution of purity of large (≥ 5) reference clusters for poorly constructed (left) and accurate (right) repertoire. Left repertoire is undercorrected: there are a lot of impure reference clusters, i.e., clusters that are erroneously split. 
Reported statistics
IgQUAST exports the number of pure/impure reference and constructed clusters (purity threshold is 95%) along with the number of discordant reads in for the reference and the constructed repertoires.
Referencefree analysis
Referencefree analysis allows one to estimate quality of the constructed repertoire and the input library. If the reference repertoire is provided, the same analysis is performed to it.
Distribution of #errors in reads. Errorrate estimation
For each initial read corresponding to large cluster (≥ 5 by default) IgQUAST computes the number of corrected errors as the Hamming distance from the corresponding repertoire sequence. Then it constructs histogram and estimates error rate (average number of sequencings errors per read) in Poisson amplificationfree model. The histogram is plotted along with a histogram of Poisson distribution with parameter equal to the estimated error rate.
Distribution of #errors in reads for the constructed (left) and the reference (right) repertoire for simulated Repseq dataset. The actual error rate is 2. 
Example above suggests that for the constructed repertoire the distribution is far from Poisson (the tail is cut). The reason is that repertoire construction algorithm commonly fails to fix highly corrupted reads, therefore, high distances are improbable. At the same time, error rate is estimated correctly (the actual value is 2).
For the reference repertoire distance distribution is almost perfect Poisson since the data was simulated according to Poisson model, i.e. we suppose that sequencing errors are (1) rare and (2) independent. Note that in case of intensive amplification (2) may be not fulfilled.
Reported statistics
IgQUAST reports the estimated error rate. Note that the estimation is reliable only if the repertoire is constructed properly. Therefore, using the reference repertoire provides more adequate estimation.
Error and discordance profiles
For each large cluster (≥ 5 by default) for each sequence position IgQUAST computes consensus and finds the numbers of occurrences of the most popular letter and of the second popular letter. Then, for each position it computes error rate (one minus the rate of the most popular letter) and discordance (the rate of the second popular letter). IgQUAST plots error rate and discordance profiles, that help to detect overcorrection. Erroneously joined similar clusters provide peaks at the plot on the positions, where cluster sequences are different (see the example below).
Error profiles for repertoires constructed by IgReC in aggressive mode (left) and default mode (right). Left repertoire is overcorrected. Peaks on the left plot correspond to positions with undetected differences. 
Discordance profiles for the same repertoires. Discordance profiles demonstrates the same behavior as error profiles. At the time, discordance may be more informative if one wants two separate two close clusters. 
IgQUAST also constructs profiles for 5 largest clusters separately, since in overcorrected repertoire largest clusters almost sure are overcorrected.
Note: for highly accurate (or reference) repertoire error profile is an estimation of sequencing error profile of the initial Repseq library.
Scatterplot of max error over position against cluster size
For each cluster IgQUAST constructs error profile (as described in the previous section) and finds the position with maximal error. Then IgQUAST constructs scatterplot of the maximal number of errors over position against cluster size. Clusters with relatively large maximal number of errors are untrustworthy (shown by red points).Scatterplots of maximal error position along sequence against cluster size for repertoires constructed by IgReC in aggressive (left) and default (right) mode. Red points correspond to potentially overcorrected clusters with extremely high number of errors in one sequence position. Dashed lines show 95% confidence interval in Poisson amplificationfree model. 
IDs of potentially overcorrected clusters are exported into JSON report file. Note that Poisson model may be imprecise in case of intensive amplification.
Feedback and bug reports
Your comments, bug reports, and suggestions are very welcome. They will help us to further improve IgQUAST.If you have any trouble running IgQUAST, please provide us the log file from the output directory.
Address for communications: igtools_support@googlegroups.com.
Manuals and citations
Manuals
IgQUAST manual can be found here.Citations
Alexander Shlemov, Sergey Bankevich, Andrey Bzikadze, Dmitriy M. Chudakov, Yana Safonova, and Pavel A. Pevzner. Reconstructing antibody repertoires from errorprone immunosequencing datasets (submitted)
To produce IgQUAST input for the paper we used tool parameters listed here