Date: 2014-10-24
Time: 15:30-16:30
Location: BURN 1205
Abstract:
Next generation sequencing (NGS) is a technology revolutionizing genetics and biology. Compared with the old Sanger sequencing method, the throughput is astounding and has fostered a slew of innovative sequencing applications. Unfortunately, the error rates are also higher, complicating many downstream analyses. For example, de novo assembly of genomes is less accurate and slower when reads include many errors. We develop a probabilistic model for NGS reads that can detect and correct errors without a reference genome and while flexibly modeling and estimating the error properties of the sequencing machine. It uses a penalized likelihood to enforce our prior belief that the kmer spectrum (collection of k-length strings observed in the reads) generated from a genome is sparse when k is sufficiently large. The model formalizes core ideas that are used in many ad hoc algorithmic approaches to error correction. We show our method can detect and remove more errors from sequencing reads than existing methods. Though our method carries a higher computational burden than the best algorithmic approaches, the probabilistic approach is extensible, flexible, and well-positioned to support downstream statistical analysis of the increasing volume of sequence data.
Speaker
Karin Dorman is an Associate Professor in the Statistics and Genetics, Development and Cell Biology departments and part of the Bioinformatics & Computational Biology interdepartmental program at Iowa State University, Ames, Iowa.