le 4 septembre 2025
Soutenance d’HDR d’Antoine LIMASSET, le jeudi 04 septembre 2025 à 9h en Atrium bâtiment ESPRIT, laboratoire CRIStAL.
Titre de l’habilitation : Nucleic Acid Data Indexation : Novel Enhancements
Composition du jury :
Rapportrice : Claire Lemaitre, Directrice de recherche, IRISA, Inria, Rennes
Rapportrice : Raluca Uricaru, Maîtresse de conférence, LaBRI, Bordeaux
Rapporteur : Puglisi Simon, Professeur, Department of Computer Science, University of Helsinki
Examinateur : Laurent Jacob, Directeur de recherche, LCQB, Sorbonne Université, Paris
Examinateur : Jean-Stéphane Varré, Professeur, CRIStAL, Université de Lille
Garant : Mikaël Salson, Professeur, CRIStAL, Université de Lille
Résumé :
This habilitation thesis presents a body of research on the indexation of nucleic acid data, exploring a spectrum of data structures designed to manage and analyze the massive datasets produced by modern sequencing technologies. In contemporary bioinformatics, the exponential growth of sequence data has shifted the primary challenge from data acquisition to our ability to efficiently store, search, and interpret this information. This work documents a personal research trajectory presenting a hierarchy of representations for sequence data, from raw reads to k-mers, contigs, and ultimately, compressed fingerprints and sketches. The core argument is that by progressively abstracting sequence information and accepting trade-offs between precision and scale, we can develop powerful computational tools to address previously intractable biological questions.
The document details a portfolio of original contributions, starting with methods for improving the fidelity of raw sequencing data (ELECTOR, CONSENT) and progressing to highly optimized data structures for exact k-mer indexing (BLIGHT, LPHASH). It further explores how refined assembly graphs can serve as powerful intermediate structures for tasks like error correction (BCOOL) and haplotype-aware assembly (BWISE).
A key focus is the analysis of massive data collections, comprising thousands or millions of datasets, where the work transitions from exact representations to probabilistic methods. This is exemplified by tools like PAC and REINDEER2, which enable rapid presence/absence or abundance queries across enormous sequence archives using partitioned, compressed indexes. These methods allow for searching entire databases like GenBank or large clinical cohorts for specific genetic signatures, moving beyond single-sample or reference-based analysis. The REINDEER2 tool illustrates this large-scale approach, providing a method to efficiently index k-mer abundances across thousands of RNA-seq datasets, enabling new avenues for biomarker discovery by querying the full transcriptional diversity present in large patient cohorts.
The thesis concludes with a forward-looking perspective on future challenges and opportunities. It advocates for leveraging delta-compression for burgeoning data repositories, integrating genomic foundation models as algorithmic components, shifting computational paradigms towards GPU-based architectures, and expanding the fundamental alphabet of sequence analysis to more richly encode biological information like epigenetic modifications.
Salle Atrium bâtiment ESPRIT Laboratoire CRIStAL Villeneuve d'Ascq