Investigators: Daniel Andr´es D´ıaz-Pach´on (PI) and J. Sunil Rao (co-PI).
The capacity to acquire and store information has increased during the 21st century in an unprecedented way due to high-performance computing (HPC). However, techniques to reduce the gigantic amounts of data we are collecting have come at a much lower speed.
According to Vapnik, the old paradigm of classical parametric statistics developed in the 1930s — based on linear regression of a few input variables, the normal distribution, and the maximum likelihood principle — failed when computer analyses appeared in the 1960’s. However, the situation became worse in at least three aspects at the turn of the century, due to the appearance of big data. First, even though large samples were desirable under the old paradigm in order to apply the central limit theorem, modern big data sources are heavily contaminated with noise. Second, traditional methods were not easily adapted for situations where the number of variables dominated the sample size. And third, the strong emphasis on mean-based inference proved to be insufficient due to the lack of robustness of the expectation, because big datasets in large dimensions tend to have several clusters in different subspaces of the original space, so that mean estimation becomes meaningless. This has led to what some are calling the arrival of “the modal age of statistics”.
For over a decade now, we have developed a research program in local mode hunting, in both the supervised and unsupervised settings. One of our first results, called Local Sparse Bump Hunting (LSBH) is a supervised divide-and-conquer strategy which was designed for high-dimensional correlated data problems, something that early versions, including the Patient Rule Induction Method (PRIM) could not do. This was applied with great success in identifying a larger diversity of colon cancers with differential survival than was able to be identified with traditional clinical staging systems. We then developed fastPRIM, an algorithm that could detect modes extremely fast and accurately for unimodal symmetric distributions, especially when the space was rotated in the direction of the principal components. This provided a great improvement over unsupervised versions of PRIM.