CRIStAL - Centre de Recherche en Informatique et Automatique de Lille

Soutenance d'HDR de Camille Marchet

du 4 au 4 septembre 2025

Soutenance d’HDR de Camille MARCHET, le jeudi 04 septembre 2025 à partir de 13h30 en Atrium bâtiment ESPRIT, laboratoire CRIStAL.

Titre de l’habilitation : A feeling for the index: k-mer data-structures for data reuse in large scale genomics and transcriptomics

Composition du jury :

Rapportrice : Sarah Djebali, Chargée de recherches, IRSD, INSERM U1220, Toulouse
Rapportice : Élodie Laine, Professeure, CQSB CNRS UMR 7238 ; INSERM 1284, IBPS, Sorbonne Université, Paris
Rapporteur : Sven Rahmann, Professor, Chair of Algorithmic Bioinformatics at Saarland University
Examinatrice : Christina Boucher, Professor, University of Florida
Examinateur : Guy Perrière, Directeur de recherches, LBBE CNRS UMR 5558, Université Claude Bernard Lyon 1
Garant : Rémi Bardenet, Directeur de recherches, CRIStAL UMR 9189 CNRS, Université de Lille

Résumé :

This habilitation thesis explores algorithmic strategies for indexing large-scale biological sequence datasets comprising billions of objects and terabytes to petabytes of raw data. The work focuses on DNA and RNA as textual inputs to data, and draws on several years of personal research centered at the CRIStAL laboratory, on the challenges of designing structures that organize k-mer sets, sets of short, fixed-length substrings of sequences. By drawing these k-mers to every possible position, DNA and RNA sequences are tokenized into sets that conserve relevant biological information, enabling scalable and efficient analysis.

As sequencing technologies produce exponentially growing volumes of RNA and DNA data, the need for efficient, scalable, and interpretable data structures becomes central to enabling meaningful analysis. This thesis presents a structured overview of existing k-mer representation families, from De Bruijn graphs to Burrows-Wheeler-transform-inspired methods, emphasizing their computational properties and trade-offs. It introduces several original contributions, including a static, fast, and memory-efficient dictionary, as well as a dynamic structure that leverages textual regularities to support optimized set operations.

Additionally, I detail methods for handling multi-sample k-mer sets (sets of k-mer sets), leading to in REINDEER, a tool specifically optimized for RNA abundance indexing across thousands of datasets. Practical applications in clinical research contexts, such as leukemia studies, illustrate the real-world impact of these innovations.

The discussion concludes with challenges related to integrating such structures into existing and future international genomic repositories. I advocate for a broader perspective on data structure research, designing tools that remain accessible to a wide user community through smart queries, which in turn push the boundaries of current data structure design.

Salle Atrium bâtiment ESPRIT Laboratoire CRIStAL Villeneuve d'Ascq

AI Master DS Seminar
Zaineb Garcia 4 mars 2026 à 13h30
Sciences infusent
du 24 au 26 mars 2026
École SED
du 25 au 27 mars 2026
WISG
du 25 au 26 mars 2026
Colloquium Polaris
Thomas Leimkuehler 26 mars 2026 à 14h
JIM 2026
du 15 au 18 juin 2026
JRD-AGITSI-2026
25 juin 2026

Plus d'actualités

Soutenance d'HDR de Camille Marchet

Agenda

AGENDA

UTILES

Recrutement