Ben Langmead


Pan-genomic methods for fighting reference bias



DNA and RNA sequencing data analysis often begins with aligning sequencing reads to a reference genome, with the reference represented as a linear string of bases.  But linearity leads to reference bias, a tendency to miss or misreport alignments containing non-reference alleles, which can confound downstream analyses.  This is a major problem in the field; we do not want diagnostics and therapeutics to be differentially effective depending on a patient's genetic background.

Meanwhile, recent advances in Bioinformatics and Computer Science allow us to index and align sequencing reads to references that include many population variants.  I will describe this journey from the early days of efficient genome indexing, continuing through two recent complementary breakthroughs in indexing graph-shaped references (Wheeler graphs) and references that include many genomes (r-index).  I will emphasize recent results showing how to optimize simple and complex pan-genome representations for effective avoidance of reference bias.  Much of this work is collaborative with Travis Gagie, Christina Boucher, Alan Kuhnle and others.

Ben Langmead is an Associate Professor of Computer Science at Johns Hopkins University.  He earned a Ph.D. in Computer Science from the University of Maryland in 2012.  His group seeks to make high-throughput biological datasets easy for biomedical researchers to use.  The group studies and applies ideas from sequence alignment, text indexing, statistics and parallel programming.  His group has released several widely software tools (e.g. Bowtie, Bowtie 2) addressing common problems in genomics.  His current work involves methods and software for efficient sequencing data analysis, methods and resources for making it easy to reuse and query archived sequencing datasets, and methods for alleviating bias and characterizing uncertainty in the read alignment problem.  He is the recipient of a Genome Biology award (2010), Sloan Research Fellowship (2014), a National Science Foundation CAREER award (2014) and the Benjamin Franklin award for contributions to open access (2016).

Nadia Pisanti


ReadMapping on Pan-Genomes with Degenerate Texts


The "Elastic Degenerate String Matching" (EDSM) problem was defined in CPM 2017 as that of finding an occurrence of a pattern P of length m in an ED-text T.
A D-text (degenerate text) is a string that actually represents a set of similar and aligned strings by collapsing common fragments into a standard linear string, and representing variants with sets of alternative substrings. When such substrings are not bound to have the same size, then we talk about elastic D-strings (ED-strings).
A (E)D-texts can be used to represent Pan-Genomes.
In CPM 2017 we gave an O(nm^2+N) time on-line algorithm for EDSM, where n is the length of T and N is its size, defined as the total number of letters. A fundamental toolkit of our algorithm is the O(m^2+N) time solution of the (later called) "Active Prefixes" (AP) problem".
In CPM 2018, Aoyama et al. gave a O(m^{1.5} \sqrt{log m}+N) solution for AP leading to a O(nm^{1.5} \sqrt{log m}+N) time solution for EDSM using our CPM 2017 algorithmic framework. The natural open problem thus became whether the 1.5 exponent could furtherly be decreased. In our ICALP 2019, we prove several properties that answer this and other questions, by succesfully combining Fast Fourier Transform and properties of string periodicity.
I will give an overview of these results as well as some related one that extend the framework.


Nadia Pisanti received her PhD  in Computer Science in 2002 at the University of Pisa, where she is currently associate professor.

She has worked or been long lasting visiting scientist at INRIA, Paris VI, Pasteur Institute, CNRS, Paris Est,  Paris Nord, Leiden University, CWI Amsterdam, King’s College London, and Academy Science in Budapest.

She is/was in the PC of ~90 international conferences, 5 times as PC-chair, and she currently joins the Editorial Board of two International journals.

Her research activity is mainly on the design of algorithms for the analysis of genomic and transcriptomic data.

She co-authored ~35 high quality journal papers, and other ~80 publications among books, book chapters and conference papers, as well as several software tools.

She joins or has joined several national and international projects on algorithmic and bioinformatics research topics. She currently leads an italian project, and joins the EU ITN ALPACA project on Computational PanGenomics.