This thesis is devoted to the theoretical analysis of unsupervised learning problems involving highly dependent time-series. Two fundamental problems are considered, namely, the problem of change point estimation and time-series clustering. The problems are considered in an extremely general framework, where the data are assumed to be generated by arbitrary, unknown stationary ergodic process distributions. This is one of the weakest assumptions in statistics, as it is more general than the parametric and model-based settings, and it subsumes most of the non-parametric frameworks considered for this class of problems. These assumptions typically have the premise that each time-series consists of independent and identically distributed observations or that it satisfies certain mixing conditions. For each of the considered problems, novel nonparametric methods are proposed, and are shown to be asymptotically consistent in this general framework. Statistical analysis in the stationary ergodic framework is extremely challenging. In general for this class of processes, rates of convergence (even of frequencies to respective probabilities) are provably impossible to obtain. As a result, given a pair of samples generated independently by stationary ergodic process distributions, it is provably impossible to distinguish between the case where they are generated by the same process or by two different ones. This in turn, implies that such problems as time-series clustering with unknown number of clusters, or change point detection, cannot possibly admit consistent solutions. Thus, a challenging task is to discover the problem formulations which admit consistent solutions in this general framework. The main contribution of this thesis is to constructively demonstrate that despite these theoretical impossibility results, natural formulations of the considered problems exist which admit consistent solutions in this framework. Specifically, natural formulations of change-point estimation and time-series clustering are proposed, and efficient algorithms are provided, which are shown to be asymptotically consistent under the assumption that the process distributions are stationary ergodic. This includes the demonstration of the fact that the correct number of change points can be found, without the need to impose stronger assumptions on the process distributions. The results presented in this work lay down the theoretical foundations for the analysis of sequential data in a broad range of real-world applications.
Directeur de Thèse : Daniil RYABKO Rapporteurs : Francis BACH, Jean-Philippe VERT Membres du Jury: Olivier CATONI, Patrick GALLINARI, Christophe BIERNACKI