Soutenance d'HDR de Camille Marchet

le 4 septembre 2025

Soutenance d’HDR de Camille MARCHET, le jeudi 04 septembre 2025 à partir de 13h30 en Atrium bâtiment ESPRIT, laboratoire CRIStAL.

Titre de l’habilitation : A feeling for the index: k-mer data-structures for data reuse in large scale genomics and transcriptomics

Composition du jury :

  • Rapportrice : Sarah Djebali, Chargée de recherches, IRSD, INSERM U1220, Toulouse

  • Rapportice : Élodie Laine, Professeure, CQSB CNRS UMR 7238 ; INSERM 1284, IBPS, Sorbonne Université, Paris

  • Rapporteur : Sven Rahmann, Professor, Chair of Algorithmic Bioinformatics at Saarland University

  • Examinatrice : Christina Boucher, Professor, University of Florida

  • Examinateur : Guy Perrière, Directeur de recherches, LBBE CNRS UMR 5558, Université Claude Bernard Lyon 1

  • Garant : Rémi Bardenet, Directeur de recherches, CRIStAL UMR 9189 CNRS, Université de Lille

Résumé :

This habilitation thesis explores algorithmic strategies for indexing large-scale biological sequence datasets comprising billions of objects and terabytes to petabytes of raw data. The work focuses on DNA and RNA as textual inputs to data, and draws on several years of personal research centered at the CRIStAL laboratory, on the challenges of designing structures that organize k-mer sets, sets of short, fixed-length substrings of sequences. By drawing these k-mers to every possible position, DNA and RNA sequences are tokenized into sets that conserve relevant biological information, enabling scalable and efficient analysis.

As sequencing technologies produce exponentially growing volumes of RNA and DNA data, the need for efficient, scalable, and interpretable data structures becomes central to enabling meaningful analysis. This thesis presents a structured overview of existing k-mer representation families, from De Bruijn graphs to Burrows-Wheeler-transform-inspired methods, emphasizing their computational properties and trade-offs. It introduces several original contributions, including a static, fast, and memory-efficient dictionary, as well as a dynamic structure that leverages textual regularities to support optimized set operations.

Additionally, I detail methods for handling multi-sample k-mer sets (sets of k-mer sets), leading to in REINDEER, a tool specifically optimized for RNA abundance indexing across thousands of datasets. Practical applications in clinical research contexts, such as leukemia studies, illustrate the real-world impact of these innovations.

The discussion concludes with challenges related to integrating such structures into existing and future international genomic repositories. I advocate for a broader perspective on data structure research, designing tools that remain accessible to a wide user community through smart queries, which in turn push the boundaries of current data structure design.

Salle Atrium bâtiment ESPRIT Laboratoire CRIStAL Villeneuve d'Ascq

Plus d'actualités