Most existing tools, however, support a limited description of the complexity of tandem repeats using a single motif, such as in GangSTR 15 and adVNTR 8 , leaving the variation in motif sequences unexplored.
While ExpansionHunter 9 allows the repeat structure to be defined by a regular expression, it is mostly restricted to STR genotyping and has not been extended to VNTRs.
LRS assemblies have megabase scale contiguity and accurate consensus sequences 16 , 17 that may be used to detect VNTR variation. LRS assemblies have been used to improve VNTR analysis with SRS when used as population-specific references that add sequences missing from the reference and improve alignments 19 , Furthermore, the genotype represents the presence of a known variant, and does not reveal the spectrum of copy number variation that exists in tandem repeat sequences Repeat length estimation in tools specialized for tandem repeat genotyping allows more biological meaningful analyses 7 , 24 , The hypervariability of VNTRs prevents a single assembly from serving as an optimal reference.
Instead, to improve both alignment and genotyping, multiple assemblies may be combined into a pangenome graph 21 , 23 , 26 , 27 composed of sequence-labeled vertices connected by edges such that haplotypes correspond to paths in the graph. Sequences shared between haplotypes are stored in the same vertex, and genetic variation is represented by the structure of the graph.
A conceptually similar construct is the repeat graph 28 , with sequences repeated multiple times in a genome represented by the same vertex. Graph analysis has been used to encode the elementary duplication structure of a genome 29 and for multiple-sequence alignment of repetitive sequences with shuffled domains 30 , making them well-suited to represent VNTRs that differ in both repeat count and composition.
The most straight-forward approach that combines a pangenome graph and a repeat graph is a de Bruijn graph, and was the basis of one of the earliest representations of a pangenome by the Cortex method 31 , The de Bruijn graph has a vertex for every distinct sequence of length k in a genome k- mer , and an edge connecting every two consecutive k -mers, thus k -mers occurring in multiple genomes or in multiple times in the same genome are stored by the same vertex.
While the Cortex method stores entire genomes in a de Bruijn graph, we construct a separate locus-RPGG for each VNTR and store a genome as the collection of locus-RPGGs, which deviates from the definition of a de Bruijn graph because the same k -mer may be stored in multiple vertices.
We find loci that demonstrate population structure with respect to the inferred lengths of VNTR sequences, and importantly to discover 8, loci that show differential motif usage between populations. We used public LRS data for 19 individuals with diverse genetic backgrounds, including genomes from individual genome projects 33 , 34 , structural variation studies 14 , and diversity panel sequencing 22 Fig.
Each genome was sequenced by either PacBio contiguous long read, or high-fidelity sequencing between 16 and fold coverage along with matched 22—fold Illumina sequencing Table 1. This data reflects a wide range of technology revisions, sequencing depth, and data type, however subsequent steps were taken to ensure accuracy of RPGG through locus redundancy and SRS alignments.
We developed a pipeline that partitions LRS reads by haplotype based on phased heterozygous SNVs and assembles haplotypes separately by chromosome. For other datasets, long-read data were used to phase SNVs. Reads from each chromosome and haplotype were independently assembled using the Flye assembler 36 for a diploid of 0. Total graph size built from GRCh38 and an average genome are also shown.
The alignments may be either used to select which loci may be accurately mapped in the RPGG using SRS that match the assemblies red , or may be used to estimate lengths on sample datasets blue.
Source data are provided as a Source Data file. During locus assignment, danbing-tk expands boundaries and merges loci to ensure boundaries of all VNTRs are well-defined and harmonized across genomes Methods Fig.
The RPGGs are a collection of independently constructed bi-directional de Bruijn graphs of each VNTR locus and flanking bases from the haplotype-resolved assemblies. The RPGG differs from a standard bi-directional de Bruijn graph because a k -mer may be repeated in multiple subgraphs. The alignment of a read to an RPGG may be defined by the path in the RPGG with a sequence label that has the minimum edit distance to the read among all possible paths.
While methods exist to find alignments that do not reuse cycles 38 , others allow alignment to cyclic graphs but with high computational costs when applied to RPGG 27 or are limited to alignment in STR regions 9. Efficient alignment with cycles is a more challenging problem recently solved by GraphAligner 39 to map long reads to pangenome graphs.
Using danbing-tk, When reads from the entire genome are considered, for The graph pruning step is the primary cause of missed alignments, and affects on average 2, loci per assembly. On real data, danbing-tk required If SRS reads from a genome were sequenced without bias, sampled uniformly, and mapped without error to the RPGG, the count of a k -mer in a locus mapped by an SRS sample should scale by a factor of read depth with the sum of the count of the k -mer from the locus of both assembled haplotypes for the same genome.
The quality of alignment aln- r 2 and sequencing bias were measured by comparing the k -mer counts from the 19 matched Illumina and LRS genomes Fig. Specifically, Loci with false mapping but retained in the final set have lower but still decent length-prediction accuracy 0. The limited sequence diversity provided by repeat-GRCh38 at this locus failed to recruit reads that map to paths existing in the RPGG but missing or only partially represented in repeat-GRCh A linear fit between the k -mers from mapped reads and the ground truth assemblies shows that there is a fold gain in slope, or measured read depth, when using RPGG compared to repeat-GRCh38 Fig.
The alignment quality is measured by the r 2 of a linear fit between the k -mer counts from the ground truth assemblies and from the mapped reads Methods.
Loci with alignment quality less than 0. To visualize paths with less mapped reads, k -mer counts are clipped at left , middle , and right , respectively, with the maximal k -mer count of each graph being , , and , respectively. New genomes with arbitrary combinations of motifs and copy numbers in VNTRs should still align to an RPGG as long as the motifs are represented in the graph.
SRS reads from the missing genome were mapped into the RPGG, and the estimated locus lengths were compared to the average diploid lengths of corresponding loci in the missing LRS assembly. As the SRS datasets used in this study during pangenome construction were collected from a wide variety of studies with different biases, there was no consistent LSB in either repetitive or nonrepetitive regions for samples from different sequencing runs Supplementary Figs.
However, principal component analysis PCA of repetitive and nonrepetitive regions showed highly similar projection patterns Supplementary Fig. Leveraging this finding, a set of nonrepetitive control regions were used to estimate the LSB of an unseen SRS sample Methods , giving a median length-prediction accuracy of 0. The read depth of a repetitive region correlates to the locus length when aligning short reads to a linear reference genome. However, estimation of VNTR length from read depth has an accuracy of 0.
Boxes span from the lower quartile to the upper quartile, with horizontal lines indicating the median. Whiskers extend to points that are within 1. Loci are ordered along the x -axis by genotyping accuracy in repeat-GRCh The red dotted line indicates the baseline without any improvement. The fraction of reads from these datasets that align to the RPGG ranges from 1. Consistent with the finding in previous leave-one-out analysis, genomes from the same study cluster together in the PCA plot of LSB, and so within each dataset and locus, k -mer counts from SRS reads normalized by sequencing depth were used to compare VNTR content across genomes.
The 1KGP samples contain individuals from African When comparing the average population length to the global average length, Each point is an individual. For each locus, the proportion of variance explained by the most informative k -mer in the EAS is shown for the EAS and AFR populations on the x - and y -axes, respectively.
Points are colored by the difference in normalized k -mer counts, with red and blue indicating k -mers more abundant in EAS and AFR populations, respectively. Edges are colored if the k -mer count is biased toward a certain population. VNTR loci that are unstable may undergo hyper-expansion and are implicated as a mechanism of multiple diseases 4. Alignment to an RPGG provides information about motif usage in addition to estimates of VNTR length because genomes with different motif composition will align to different vertices in the graph.
To detect differential motif usage, we searched for loci with a k -mer that was more frequent in one population than another and not simply explained by a difference in locus length, comparing African AFR and East Asian populations for maximal genetic diversity. Lasso regression against locus length was used to find the k -mer with the most variance explained VEX in EAS genomes, denoted as the most informative k -mer mi-kmer.
A top example of these loci with r 2 at least 0. As the danbing-tk length estimates showed population genetic patterns expected for human diversity, we assessed whether danbing-tk alignments could detect VNTR variation with functional impact. Genomes from the GTEx project were mapped into the RPGG to discover loci that have an effect on nearby gene expression in a length-dependent manner. Similar to the population analysis, the k -mer dosage was used as a proxy for locus length. The quantile-quantile plot shows the observed P- value of each association test two-sided t -test versus the P -value drawn from the expected uniform distribution.
The eVNTRs have the potential to yield insight to disease. The protein product of ERAP2 , or endoplasmic reticulum aminopeptidase 2, is a zinc metalloaminopeptidase involving in the process of Class I MHC mediated antigen presentation and innate immune response.
Abnormal expansion of the VNTR might increase autoimmune disease risk through reducing ERAP2 expression, leaving longer and more antigenic peptides, yet potentially higher fitness against virus infection Deletion of this gene is linked to Koolen-de Vries syndrome 47 , and the locus is associated with Parkinson disease Previous commentaries have proposed that variation in VNTR loci may represent a component of undiagnosed disease and missing heritability 49 , which has remained difficult to profile even with whole-genome sequencing To address this, we have proposed an approach that combines multiple genomes into a pangenome graph that represents the repeat structure of a population.
This is supported by the software, danbing-tk and associated RPGG. We used danbing-tk to generate a pangenome from 19 haplotype-resolved assemblies, and applied it to detect VNTR variation across populations and to discover eQTL. For example, with our approach, we are able to detect loci with differential motif usage between populations, which could be difficult to characterize using an approach such as multiple-sequence alignment of VNTR sequences from assembled genomes.
There are several caveats to our approach. Datasets combined from disparate sequencing runs with batch effects will affect dosage estimates. In contrast to other pangenome approaches 27 , 38 , danbing-tk does not keep track of a reference e.
Furthermore, because it is often not possible to reconstruct a unique path in an RPGG, only counts of mapped reads are reported rather than the order of traversal of the RPGG.
An additional caveat of our approach is that genotype is calculated as a continuum of k -mer dosage rather than discrete units, prohibiting direct calculation of linkage-disequilibrium for fine-scale mapping The rich data Supplementary Data 7 — 9 provided by danbing-tk and pangenome analysis provide the basis for additional association studies.
While most analysis in this study focused on the diversity of VNTR length or association of length and expression, it is possible to query differential motif usage using the RPGG. The ability to detect motifs that have differential usage between populations brings the possibility of detecting differential motif usage between cases and controls in association studies. This can help distinguish stabilizing versus fragile motifs 52 , or resolve some of the problem of missing heritability by discovering new associations between motif and disease Finally, this work is a part of ongoing pangenome graph analysis 53 , 54 , and represents an approach to generating pangenome graphs in loci that have difficult multiple-sequence alignments or degenerate graph topologies.
Additional methods may be developed to harmonize danbing-tk RPGGs with genome-wide pangenome graphs constructed from other methods. TRF 37 v4. The scope of this work focuses on VNTRs that cannot be resolved by typical short-read sequencing methods.
To maintain data quality, VNTR loci that could not be assigned homology were removed from datasets. A misannotation of VNTR boundaries can cause erroneous length estimates. The algorithm maintains an invariant: the flanking sequence in any of the haplotypes does not share k -mers with the VNTR regions from all haplotypes.
Haplotypes with distance between adjacent loci inconsistent with the majority of haplotypes are removed. The second data structure enables graph threading by storing a bi-directional de Bruijn graph for each locus. The third data structure is used for counting k -mers originating from VNTRs. The read mapping algorithm maps each pair of Illumina paired-end reads to a unique VNTR locus in three phases: 1 In the k -mer set mapping phase, the read pair is converted to a pair of canonical k -mer multisets.
The VNTR locus with the highest count of intersected k -mers is detected with the first data structure. To account for sequencing and assembly errors, the algorithm is allowed to edit a limited number of nucleotides in a read if no matching k -mer is found in the graph.
The read pair is determined feasible to map to a VNTR locus if the number of mapped k -mers is above an empirical threshold. Finally, the read mapping algorithm returns the k -mer counts for all loci as mapped by SRS reads. Alignment timing was conducted on an Intel Xeon Ev2 2. Pan-genome representation provides a more thorough description of VNTR diversity and reduces reference allele bias, which effectively improves the quality of read mapping and downstream analysis.
We ran the read mapping algorithm with error correction disabled so as to detect k -mers unsupported by SRS reads. The three data structures were updated by deleting all unsupported k -mers for each locus. By referencing the mapping relation of VNTR loci across individuals, we encoded the variability of each VNTR locus by merging the three data structures across individuals.
To evaluate the quality of the haplotype assemblies and the performance of the read mapping algorithm, VNTR k -mer counts in the original assemblies were regressed against those mapped from SRS reads. The r 2 of the linear fit was used as the alignment quality score referred to as aln- r 2. To measure alignment quality in the pan-genome setting, only the k -mer set derived from the genotyped individual was retained as the input for regression.
Several of the commercial STR human identity kits have been optimized for automated DNA sequencers which enables data collection and analysis by sophisticated software. Commercial kits for VNTR analysis currently are not available. Sixteen post-transplant samples from four allogeneic BMT patients are included in this correlation study.
Genomic DNA was isolated from peripheral blood or bone marrow. After amplification, 2. DNA was diluted to 0. Each reaction contained 6. Thermocycling conditions were specific for the VNTR to be amplified. Southern analysis was performed by standard techniques. Digested DNA was electrophoresed, transferred to a nylon solid support, and hybridized with a radiolabeled probe complementary to an informative VNTR repeat. Results were visualized by autoradiography. Probe pTHH59 15 was used for Southern analysis.
Peak area ratios see below of informative donor and recipient alleles are formulated to express the DNA percentage. Specific formulas for calculating the percentage of donor or recipient DNA are based on whether the informative loci are heterozygous or homozygous and whether the donor and recipient share any alleles. Because recipient DNA is usually the minor component in post-transplant samples, calculation of the percentage of recipient DNA is usually performed.
The shared allele is omitted from the calculation:. The calculated percentages of recipient DNA were then plotted vs the known percentages. Curves were fit by linear regression analysis. Standard curves typically show little deviation from linearity see Results.
For post-transplant sample STR analysis in this study, the percentage of recipient DNA is interpolated from the standard curve. Because replicates were not run, the precision of the assay is not defined and the reported values are considered to be estimates.
Standard curves were not generated in this study for VNTR analysis based on an assumption of linearity. They consist of small peaks that are one repeat unit shorter than the main amplified peak.
In cases where an informative allele in a stutter position must be used for post-transplant analysis, contribution from the stutter peak is compensated for by subtraction of a percentage of the main peak area.
This percentage is determined by analysis of the relevant stutter peak in the pre-transplant specimen. Patients are designated by letters A to D. Serial samples were examined for all four patients. These genotypes identified informative loci that distinguish the recipient from the donor. Loci are considered to be informative if they show one or more unique alleles for recipient or donor. Patient A had a total of nine informative alleles four recipient, five donor in three loci. Patient B had a total of four informative alleles two recipient, two donor in two loci.
Patient C had a total of five informative alleles two recipient, three donor in three loci. Patient D had a total of six informative alleles three recipient, three donor in two loci. Thirteen of the 16 samples were quantitated at two independent loci. Three samples were quantitated at a single locus. Values for the same sample determined at independent loci agree closely which demonstrates the internal consistency of the STR multiplex PCR.
No stutter peak interference was present in sample B or C. Overall, STR kit performance and internal consistency for the analysis of patient samples are excellent. For the STR analysis, a standard curve was generated by mixing known ratios of donor and pre-transplant recipient genomic DNA see Methods.
The linearity of these standard curves validates the use of STR analysis for quantitation of mixtures of donor and recipient DNA. Standard curves for mixtures of donor and pre-transplant recipient DNA. Known percentages of pre-transplant recipient DNA are plotted against values calculated from STR analysis at informative loci. Curves are fit by linear regression. Tandem repeats are interspersed throughout the human genome.
Some sequences are found at only one site -- a single locus -- in the human genome. For many tandem repeats, the number of repeated units vary between individuals.
Such loci are termed VNTRs.
0コメント