|
AMOS Assembly Forensics
Adam Phillippy,
Michael Schatz,
Mihai Pop Since the initial "draft" sequence of the human genome was released in 2001, it has become clear that it was not an entirely accurate reconstruction of the genome. Despite significant advances in sequencing and assembly since then, genome sequencing continues to be an inexact process. Genome finishing and validation have remained a largely manual and expensive process, and consequently, many genomes are presented as draft assemblies. Draft assemblies are of unknown quality and potentially contain significant mis-assemblies, such as collapsed repeats, sequence excision, or artificial rearrangements. Too often these assemblies are judged only by contig size, with larger contigs preferred without regard to quality, because it has been difficult to gauge large scale assembly quality. Our new automated software pipeline, amosvalidate, addresses this deficiency and automatically detects mis-assemblies using a battery of known and novel assembly quality metrics. Instead of focusing on a single assembly characteristic as other validation approaches have tried, the power of our approach comes from leveraging multiple sources of evidence. amosvalidate statistically analyzes mate-pair orientations and separations, repeat content, depth-of-coverage, correlated polymorphisms in the read alignments, and read alignment breakpoints to identify structurally suspicious regions of the assembly. The suspicious regions identified by individual metrics are then clustered and combined to identify (with high confidence) regions that are mis-assembled. This approach is necessary for accurately detecting mis-assemblies because each of the individual characteristics has unavoidable natural variation, but, when considered together, have greatly increased analysis power. Furthermore, our pipeline can easily be adjusted to analyze assemblies utilizing new sequencing technologies where some metrics are unreliable or not available, such as base pair quality or mate pairs. Our validation pipeline provides a robust measure of assembly quality that goes beyond the simple measures commonly reported. Evaluation of the pipeline has shown it to be highly sensitive for mis-assembly detection, and has revealed mis-assemblies in both draft and finished genomes. This is particularly troubling as scientists move away from the “gene by gene” paradigm and attempt to understand the global organization of genomes. Without a correct genome sequence or even a clear understanding of the errors present, such studies may draw incorrect conclusions. Our goals are to help scientists locate mis-assembled regions of an assembly and help them correct those regions by focusing their efforts where it is needed most. amosvalidate is compatible with many common assembly formats and is released open-source. Pipeline stages:1. Matepair Analysis 2. Correlated SNPs Analysis 3. Read Coverage Analysis 4. Breakpoint Analysis 5. Repeat K-mer Analysis 6. Feature Combiner Related tools: Hawkeye MUMmer 1. Matepair HappinessMatepairs from a double barreled shotgun sequencing library should be oriented towards each other, and their distance apart in the assembly should match the library's size distribution. The tool asmQC looks for regions where multiple matepairs are mis-oriented or the insert coverage is low. Both can indicate the assembly has a rearrangement misassembly. The tool cestat-cov computes a per-library statistic called the CE statistic at every position in the assembly. The CE statistics indicates how well the mates spanning a positing match the library's distribution. If the mates are consistently closer than expected at a given position, as would occur in a collapsed repeat or excision from the assembly, the statistic will have a large negative value (ce < -4). If the inserts are consistently larger than expected, such as from a repeat copy number expansion or other insertion event, the statistic will have a large positive value (ce > 4)
cestat-cov output file: asm.ce.feat
asmQC output is written directly to the bank 2. Correlated SNP DetectionCorrelated SNPs are positions in the genome where most of the reads are one base, but multiple other reads have another base. Unlike sequencing errors that occur at random, these correlated discrepanices can indicate the presense of a mis-assembly. In a haploid bacterial genome, for example, correlated SNPs nearly always indicate 2 copies of a near identical repeat have been collapsed into a single copy. In diploid or polyploid genomes, these can indicate a collapsed repeat, or positions where the homologous chromosomes disagree. If the frequency is higher than expected biologically, it is strong evidence for a collapsed repeat. analyzeSNPs output file: asm.snps clusterSNPs output file: asm.snp.feat 3. Read CoverageIf the libraries have been constructed using a random shearing process, the reads should uniformily cover the genome at the average depth of coverage. Regions where the coverage is deeper than expected can indicate a collapsed repeat. analyze-read-depth output file: asm.depth.feat 4. Singleton Breakpoint AnalysisAfter an assembly is complete, there can be reads left over, called singletons, that are not placed in the assembly. These reads are often from contaiminate DNA or otherwise low quality sequence and can be safely ignored. However, some types of mis-assemblies can cause singletons where a portion of the read will align well to the contig but the rest of the read past the mis-assembly junction does not. If there are multiple reads that all follow the same pattern of partially aligning until the same position, this is strong evidence for mis-assembly. listReadPlacedStatus output file: asm.singletons casm-breaks output file: asm.break.fea
5. Repeat K-mer AnalysisAlmost all mis-assemblies are caused by repeats, and thus it can be useful to find the locations of the repeats in an assembly. Furthermore, it is very interesting to find the locations of collapsed or expanded repeats. We developed a new metric, called normalized k-mer analysis, that can discover collapsed or expanded repeats. A k-mer is a k-length substring of a longer sequence. Using a sliding window across a sequence, we can catalog all k-mers and count the number of occurences of each. Call K_r the set of k-mers in the reads, and K_c the set of k-mers in the contig consensus sequences. A normalized k-mer count, K*, is the number of times a given k-mer q occurs in K_r divided by the number of times q occurs in K_c. This simple statistic can reveal which repeats have been mis-assembled. For example, the number of times the k-mers across a 2 copy repeat will be present in K_r is 2 * the depth of coverage. If the 2-copy repeat occurs in 2-copies in the assembly, then those kmers will all be present twice in K_c, and K* will be equal to the depth of coverage. If, however, the repeat was collapsed and occurs only once, then K_c will be 1 across the repeat, and K* will be equal to 2*the average depth of coverage. count-kmers output file: asm.22.n22mers kmer-cov output file: asm.nkmer.feat Feature CombinerThe above metrics can find many different types of mis-assemblies, but each is limited in type of mis-assembly it can find. Furthermore, normal statistical variation may introduce false positives in the analysis. For example, flagging every insert mate whose size is less than 2 standard deviations from the library mean will flag about 2.5% of the inserts even though the vast majority are correct. Instead we use a feature combiner to collect all of the evidence for a mis-assembly and output regions with multiple mis-assembly features present at the same region. This allows one to focus their attention on the regions that are most likely to be mis-assemblied. All of the features are loaded into the bank, and will then be visible within Hawkeye for further inspection. suspiciousfeat2region output file: asm.suspicious.feat |