Algorithmic Statistics: Forty Years Later.
Algorithmic statistics has two different (and almost orthogonal) motivations. From the philosophical point of view, it tries to formalize how the statistics works and why some statistical models are better than others. After this notion of a "good model" is introduced, a natural question arises: it is possible that for some piece of data there is no good model? If yes, how often these bad ("non-stochastic") data appear "in real life"? Another, more technical motivation comes from algorithmic information theory. In this theory a notion of complexity of a finite object (=amount of information in this object) is introduced; it assigns to every object some number, called its algorithmic complexity (or Kolmogorov complexity). Algorithmic statistic provides a more fine-grained classification: for each finite object some curve is defined that characterizes its behavior. It turns out that several different definitions give (approximately) the same curve. In this survey we try to provide an exposition of the main results in the field (including full proofs for the most important ones), as well as some historical comments. We assume that the reader is familiar with the main notions of algorithmic information (Kolmogorov complexity) theory.
A survey of main results in algorithmic statistics
We present a new structural lemma for deterministic con- text free languages. From the first sight, it looks like a pumping lemma, because it is also based on iteration properties, but it has significant distinctions that makes it much easier to apply. The structural lemma is a combinatorial analogue of KC-DCF-Lemma (based on Kolmogorov complexity), presented by Li and Vit ́anyi in 1995 and corrected by Glier in 2003. The structural lemma allows not only to prove that a language is not a DCFL, but discloses the structure of DCFLs Myhill-Nerode classes.
The paper [Harry Buhrman, Michal Kouck ́, Nikolay Vereshcha- y gin. Randomized Individual Communication Complexity. IEEE Con- ference on Computational Complexity 2008: 321-331] considered com- munication complexity of the following problem. Alice has a bi- nary string x and Bob a binary string y, both of length n, and they want to compute or approximate Kolmogorov complexity C(x|y) of x conditional to y. It is easy to show that deterministic communica- tion complexity of approximating C(x|y) with precision α is at least n − 2α − O(1). The above referenced paper asks what is random- ized communication complexity of this problem and shows that for r- round randomized protocols its communication complexity is at least Ω((n/α)1/r ). In this paper, for some positive ε, we show the lower bound 0.99n for (worst case) communication length of any random- ized protocol that with probability at least 0.01 approximates C(x|y) with precision εn for all input pairs.
The halting problem is undecidable --- but can it be solved for "most" inputs? This natural question was considered in a number of papers, in different settings. We revisit their results and show that most of them can be easily proven in a natural framework of optimal machines (considered in algorithmic information theory) using the notion of Kolmogorov complexity. We also consider some related questions about this framework and about asymptotic properties of the halting problem. In particular, we show that the fraction of terminating programs cannot have a limit, and all limit points are Martin-Löf random reals. We then consider mass problems of finding an approximate solution of halting problem and probabilistic algorithms for them, proving both positive and negative results. We consider the fraction of terminating programs that require a long time for termination, and describe this fraction using the busy beaver function. We also consider approximate versions of separation problems, and revisit Schnorr's results about optimal numberings showing how they can be generalized.